Overview

Principle Component Analysis is widely used in data exploration, dimension reduction, data visualization. The aim is to transform original data into uncorrelated linear combinations of the original data while keeping the information contained in the data. High dimensional data tends to show clusters in lower dimensional view.

Clustering Analysis is another form of EDA. Here we are hoping to group data points which are close to each other within the groups and far away between different groups. Clustering using PC’s can be effective. Clustering analysis can be very subjective in the way we need to summarize the properties within each group.

Both PCA and Clustering Analysis are so called unsupervised learning. There is no response variables involved in the process.

For supervised learning, we try to find out how does a set of predictors relate to some response variable of the interest. Multiple regression is still by far, one of the most popular methods. We use linear models as a working model for its simplicity and interpretability. It is important that we use domain knowledge as much as we can to determine the form of the response as well as the function format of the factors.

0.1 Objectives

  • PCA
  • SVD
  • Clustering Analysis
  • Linear Regression

0.2 Review materials

  • Study Module 2: PCA
  • Study Module 3: Clustering Analysis
  • Study Module 4: Multiple regression

0.3 Data needed

  • NLSY79.csv
  • brca_subtype.csv
  • brca_x_patient.csv

1 Case study 1: Self-seteem

Self-esteem generally describes a person’s overall sense of self-worthiness and personal value. It can play significant role in one’s motivation and success throughout the life. Factors that influence self-esteem can be inner thinking, health condition, age, life experiences etc. We will try to identify possible factors in our data that are related to the level of self-esteem.

In the well-cited National Longitudinal Study of Youth (NLSY79), it follows about 13,000 individuals and numerous individual-year information has been gathered through surveys. The survey data is open to public here. Among many variables we assembled a subset of variables including personal demographic variables in different years, household environment in 79, ASVAB test Scores in 81 and Self-Esteem scores in 81 and 87 respectively.

The data is store in NLSY79.csv.

Here are the description of variables:

Personal Demographic Variables

  • Gender: a factor with levels “female” and “male”
  • Education05: years of education completed by 2005
  • HeightFeet05, HeightInch05: height measurement. For example, a person of 5’10 will be recorded as HeightFeet05=5, HeightInch05=10.
  • Weight05: weight in lbs.
  • Income87, Income05: total annual income from wages and salary in 2005.
  • Job87, Job05: job type in 1987 and 2005, including Protective Service Occupations, Food Preparation and Serving Related Occupations, Cleaning and Building Service Occupations, Entertainment Attendants and Related Workers, Funeral Related Occupations, Personal Care and Service Workers, Sales and Related Workers, Office and Administrative Support Workers, Farming, Fishing and Forestry Occupations, Construction Trade and Extraction Workers, Installation, Maintenance and Repairs Workers, Production and Operating Workers, Food Preparation Occupations, Setters, Operators and Tenders, Transportation and Material Moving Workers

Household Environment

  • Imagazine: a variable taking on the value 1 if anyone in the respondent’s household regularly read magazines in 1979, otherwise 0
  • Inewspaper: a variable taking on the value 1 if anyone in the respondent’s household regularly read newspapers in 1979, otherwise 0
  • Ilibrary: a variable taking on the value 1 if anyone in the respondent’s household had a library card in 1979, otherwise 0
  • MotherEd: mother’s years of education
  • FatherEd: father’s years of education
  • FamilyIncome78

Variables Related to ASVAB test Scores in 1981

Test Description
AFQT percentile score on the AFQT intelligence test in 1981
Coding score on the Coding Speed test in 1981
Auto score on the Automotive and Shop test in 1981
Mechanic score on the Mechanic test in 1981
Elec score on the Electronics Information test in 1981
Science score on the General Science test in 1981
Math score on the Math test in 1981
Arith score on the Arithmetic Reasoning test in 1981
Word score on the Word Knowledge Test in 1981
Parag score on the Paragraph Comprehension test in 1981
Numer score on the Numerical Operations test in 1981

Self-Esteem test 81 and 87

We have two sets of self-esteem test, one in 1981 and the other in 1987. Each set has same 10 questions. They are labeled as Esteem81 and Esteem87 respectively followed by the question number. For example, Esteem81_1 is Esteem question 1 in 81.

The following 10 questions are answered as 1: strongly agree, 2: agree, 3: disagree, 4: strongly disagree

  • Esteem 1: “I am a person of worth”
  • Esteem 2: “I have a number of good qualities”
  • Esteem 3: “I am inclined to feel like a failure”
  • Esteem 4: “I do things as well as others”
  • Esteem 5: “I do not have much to be proud of”
  • Esteem 6: “I take a positive attitude towards myself and others”
  • Esteem 7: “I am satisfied with myself”
  • Esteem 8: “I wish I could have more respect for myself”
  • Esteem 9: “I feel useless at times”
  • Esteem 10: “I think I am no good at all”

1.1 Data preparation

Load the data. Do a quick EDA to get familiar with the data set. Pay attention to the unit of each variable. Are there any missing values?

  • They data has 46 columns and 2431 individual observations in total. As shown in the summary output, there are no missing values. All of the variables are numeric.
data.esteem=read.csv("data/NLSY79.csv")
dim(data.esteem)
## [1] 2431   46
names(data.esteem)
##  [1] "Subject"        "Gender"         "Education05"    "Income87"      
##  [5] "Job05"          "Income05"       "Weight05"       "HeightFeet05"  
##  [9] "HeightInch05"   "Imagazine"      "Inewspaper"     "Ilibrary"      
## [13] "MotherEd"       "FatherEd"       "FamilyIncome78" "Science"       
## [17] "Arith"          "Word"           "Parag"          "Number"        
## [21] "Coding"         "Auto"           "Math"           "Mechanic"      
## [25] "Elec"           "AFQT"           "Esteem81_1"     "Esteem81_2"    
## [29] "Esteem81_3"     "Esteem81_4"     "Esteem81_5"     "Esteem81_6"    
## [33] "Esteem81_7"     "Esteem81_8"     "Esteem81_9"     "Esteem81_10"   
## [37] "Esteem87_1"     "Esteem87_2"     "Esteem87_3"     "Esteem87_4"    
## [41] "Esteem87_5"     "Esteem87_6"     "Esteem87_7"     "Esteem87_8"    
## [45] "Esteem87_9"     "Esteem87_10"
summary(data.esteem)
##     Subject         Gender      Education05      Income87    
##  Min.   :    2   female:1199   Min.   : 6.0   Min.   :   -2  
##  1st Qu.: 1592   male  :1232   1st Qu.:12.0   1st Qu.: 4500  
##  Median : 3137                 Median :13.0   Median :12000  
##  Mean   : 3504                 Mean   :13.9   Mean   :13399  
##  3rd Qu.: 4668                 3rd Qu.:16.0   3rd Qu.:19000  
##  Max.   :12140                 Max.   :20.0   Max.   :59387  
##                                                              
##                                                              Job05     
##  10 TO 430: Executive, Administrative and Managerial Occupations: 377  
##  5000 TO 5930: Office and Administrative Support Workers        : 360  
##  4700 TO 4960: Sales and Related Workers                        : 205  
##  6200 TO 6940: Construction Trade and Extraction Workers        : 135  
##  2200 TO 2340: Teachers                                         : 120  
##  9000 TO 9750: Transportation and Material Moving Workers       : 117  
##  (Other)                                                        :1117  
##     Income05         Weight05    HeightFeet05    HeightInch05     Imagazine    
##  Min.   :    63   Min.   : 81   Min.   :-4.00   Min.   : 0.00   Min.   :0.000  
##  1st Qu.: 22650   1st Qu.:150   1st Qu.: 5.00   1st Qu.: 2.00   1st Qu.:0.000  
##  Median : 38500   Median :180   Median : 5.00   Median : 5.00   Median :1.000  
##  Mean   : 49415   Mean   :183   Mean   : 5.18   Mean   : 5.32   Mean   :0.718  
##  3rd Qu.: 61350   3rd Qu.:209   3rd Qu.: 5.00   3rd Qu.: 8.00   3rd Qu.:1.000  
##  Max.   :703637   Max.   :380   Max.   : 8.00   Max.   :11.00   Max.   :1.000  
##                                                                                
##    Inewspaper       Ilibrary       MotherEd       FatherEd    FamilyIncome78 
##  Min.   :0.000   Min.   :0.00   Min.   : 0.0   Min.   : 0.0   Min.   :    0  
##  1st Qu.:1.000   1st Qu.:1.00   1st Qu.:11.0   1st Qu.:10.0   1st Qu.:11167  
##  Median :1.000   Median :1.00   Median :12.0   Median :12.0   Median :20000  
##  Mean   :0.861   Mean   :0.77   Mean   :11.7   Mean   :11.8   Mean   :21252  
##  3rd Qu.:1.000   3rd Qu.:1.00   3rd Qu.:12.0   3rd Qu.:14.0   3rd Qu.:27500  
##  Max.   :1.000   Max.   :1.00   Max.   :20.0   Max.   :20.0   Max.   :75001  
##                                                                              
##     Science         Arith           Word          Parag          Number    
##  Min.   : 0.0   Min.   : 0.0   Min.   : 0.0   Min.   : 0.0   Min.   : 0.0  
##  1st Qu.:13.0   1st Qu.:13.0   1st Qu.:23.0   1st Qu.:10.0   1st Qu.:29.0  
##  Median :17.0   Median :19.0   Median :28.0   Median :12.0   Median :36.0  
##  Mean   :16.3   Mean   :18.6   Mean   :26.6   Mean   :11.2   Mean   :35.5  
##  3rd Qu.:20.0   3rd Qu.:25.0   3rd Qu.:32.0   3rd Qu.:14.0   3rd Qu.:44.0  
##  Max.   :25.0   Max.   :30.0   Max.   :35.0   Max.   :15.0   Max.   :50.0  
##                                                                            
##      Coding          Auto           Math         Mechanic         Elec     
##  Min.   : 0.0   Min.   : 0.0   Min.   : 0.0   Min.   : 0.0   Min.   : 0.0  
##  1st Qu.:38.0   1st Qu.:10.0   1st Qu.: 9.0   1st Qu.:11.0   1st Qu.: 9.0  
##  Median :48.0   Median :14.0   Median :14.0   Median :14.0   Median :12.0  
##  Mean   :47.1   Mean   :14.3   Mean   :14.3   Mean   :14.4   Mean   :11.6  
##  3rd Qu.:57.0   3rd Qu.:18.0   3rd Qu.:20.0   3rd Qu.:18.0   3rd Qu.:15.0  
##  Max.   :84.0   Max.   :25.0   Max.   :25.0   Max.   :25.0   Max.   :20.0  
##                                                                            
##       AFQT         Esteem81_1     Esteem81_2     Esteem81_3     Esteem81_4  
##  Min.   :  0.0   Min.   :1.00   Min.   :1.00   Min.   :1.00   Min.   :1.00  
##  1st Qu.: 31.9   1st Qu.:1.00   1st Qu.:1.00   1st Qu.:3.00   1st Qu.:1.00  
##  Median : 57.0   Median :1.00   Median :1.00   Median :4.00   Median :2.00  
##  Mean   : 54.7   Mean   :1.42   Mean   :1.42   Mean   :3.51   Mean   :1.57  
##  3rd Qu.: 78.2   3rd Qu.:2.00   3rd Qu.:2.00   3rd Qu.:4.00   3rd Qu.:2.00  
##  Max.   :100.0   Max.   :4.00   Max.   :4.00   Max.   :4.00   Max.   :4.00  
##                                                                             
##    Esteem81_5     Esteem81_6     Esteem81_7     Esteem81_8     Esteem81_9  
##  Min.   :1.00   Min.   :1.00   Min.   :1.00   Min.   :1.00   Min.   :1.00  
##  1st Qu.:3.00   1st Qu.:1.00   1st Qu.:1.00   1st Qu.:3.00   1st Qu.:3.00  
##  Median :4.00   Median :2.00   Median :2.00   Median :3.00   Median :3.00  
##  Mean   :3.46   Mean   :1.62   Mean   :1.75   Mean   :3.13   Mean   :3.16  
##  3rd Qu.:4.00   3rd Qu.:2.00   3rd Qu.:2.00   3rd Qu.:4.00   3rd Qu.:4.00  
##  Max.   :4.00   Max.   :4.00   Max.   :4.00   Max.   :4.00   Max.   :4.00  
##                                                                            
##   Esteem81_10    Esteem87_1     Esteem87_2    Esteem87_3     Esteem87_4 
##  Min.   :1.0   Min.   :1.00   Min.   :1.0   Min.   :1.00   Min.   :1.0  
##  1st Qu.:3.0   1st Qu.:1.00   1st Qu.:1.0   1st Qu.:3.00   1st Qu.:1.0  
##  Median :3.0   Median :1.00   Median :1.0   Median :4.00   Median :1.0  
##  Mean   :3.4   Mean   :1.38   Mean   :1.4   Mean   :3.58   Mean   :1.5  
##  3rd Qu.:4.0   3rd Qu.:2.00   3rd Qu.:2.0   3rd Qu.:4.00   3rd Qu.:2.0  
##  Max.   :4.0   Max.   :4.00   Max.   :4.0   Max.   :4.00   Max.   :4.0  
##                                                                         
##    Esteem87_5     Esteem87_6     Esteem87_7     Esteem87_8    Esteem87_9  
##  Min.   :1.00   Min.   :1.00   Min.   :1.00   Min.   :1.0   Min.   :1.00  
##  1st Qu.:3.00   1st Qu.:1.00   1st Qu.:1.00   1st Qu.:3.0   1st Qu.:3.00  
##  Median :4.00   Median :2.00   Median :2.00   Median :3.0   Median :3.00  
##  Mean   :3.53   Mean   :1.59   Mean   :1.72   Mean   :3.1   Mean   :3.06  
##  3rd Qu.:4.00   3rd Qu.:2.00   3rd Qu.:2.00   3rd Qu.:4.0   3rd Qu.:4.00  
##  Max.   :4.00   Max.   :4.00   Max.   :4.00   Max.   :4.0   Max.   :4.00  
##                                                                           
##   Esteem87_10  
##  Min.   :1.00  
##  1st Qu.:3.00  
##  Median :3.00  
##  Mean   :3.37  
##  3rd Qu.:4.00  
##  Max.   :4.00  
## 

1.2 Self esteem evaluation

Let concentrate on Esteem scores evaluated in 87.

  1. Reverse Esteem 1, 2, 4, 6, and 7 so that a higher score corresponds to higher self-esteem. (Hint: if we store the esteem data in data.esteem, then data.esteem[, c(1, 2, 4, 6, 7)] <- 5 - data.esteem[, c(1, 2, 4, 6, 7)] to reverse the score.)
data.esteem[,  c(37, 38, 40, 42, 43)]  <- 5 - data.esteem[,  c(37, 38, 40, 42, 43)]
  1. Write a brief summary with necessary plots about the 10 esteem measurements.

The average esteem scores seem to be roughly around 3.0 ~ 3.5.

barplot(colMeans(data.esteem[,c(37, 38, 39, 40, 41, 42, 43, 44, 45, 46)]), main="Esteem Measurements",xlab="Average Scores")

The esteem scores are relatively consistent across the years, with only Esteem_4 changing from a majority 3 score to a majority 4 score.

library(tidyr)
library(ggplot2)
graphable_esteem <- data.esteem[,27:46]
ggplot(gather(graphable_esteem), aes(value)) + 
    geom_histogram(bins = 10) + 
    facet_wrap(~key, scales = 'free_x')

3. Do esteem scores all positively correlated? Report the pairwise correlation table and write a brief summary.

The correlation coefficient is a statistical measure of the strength of the relationship between the relative movements of two variables. There appears to be quite a strong positive correlation among all the esteem scores. For example, we see that question 1 (“I am a person of worth”) and question 2 (“I have a number of good qualities”) are highly correlated with a correlation coefficient of 0.7.

res2 <- cor(data.esteem[,c(37, 38, 39, 40, 41, 42, 43, 44, 45, 46)])
res2
##             Esteem87_1 Esteem87_2 Esteem87_3 Esteem87_4 Esteem87_5 Esteem87_6
## Esteem87_1       1.000      0.704      0.448      0.528      0.399      0.464
## Esteem87_2       0.704      1.000      0.443      0.551      0.402      0.481
## Esteem87_3       0.448      0.443      1.000      0.408      0.549      0.410
## Esteem87_4       0.528      0.551      0.408      1.000      0.381      0.509
## Esteem87_5       0.399      0.402      0.549      0.381      1.000      0.405
## Esteem87_6       0.464      0.481      0.410      0.509      0.405      1.000
## Esteem87_7       0.379      0.410      0.343      0.422      0.370      0.600
## Esteem87_8       0.273      0.283      0.351      0.295      0.381      0.409
## Esteem87_9       0.236      0.259      0.349      0.287      0.354      0.364
## Esteem87_10      0.312      0.330      0.460      0.366      0.436      0.442
##             Esteem87_7 Esteem87_8 Esteem87_9 Esteem87_10
## Esteem87_1       0.379      0.273      0.236       0.312
## Esteem87_2       0.410      0.283      0.259       0.330
## Esteem87_3       0.343      0.351      0.349       0.460
## Esteem87_4       0.422      0.295      0.287       0.366
## Esteem87_5       0.370      0.381      0.354       0.436
## Esteem87_6       0.600      0.409      0.364       0.442
## Esteem87_7       1.000      0.389      0.352       0.390
## Esteem87_8       0.389      1.000      0.430       0.438
## Esteem87_9       0.352      0.430      1.000       0.579
## Esteem87_10      0.390      0.438      0.579       1.000
  1. PCA on 10 esteem measurements. (centered but no scaling)

    1. Report the PC1 and PC2 loadings. Are they unit vectors? Are they orthogonal?

As shown in the output below, they are indeed orthogonal unit vectors. The PCs are all independent from one another and the squared sum of the PC1 Coefficient equals 1.

#Centering 
library(dplyr)
center_scale <- function(x) {
    scale(x, center= TRUE, scale = TRUE)
}
data.esteem.centered <- center_scale(data.esteem[,c(37, 38, 39, 40, 41, 42, 43, 44, 45, 46)])
data.esteem.centered <- as.data.frame(data.esteem.centered)
pc.2 <- prcomp(data.esteem.centered)
names(pc.2)
## [1] "sdev"     "rotation" "center"   "scale"    "x"
pc.2$rotation[, 1:2]
##               PC1     PC2
## Esteem87_1  0.324 -0.4452
## Esteem87_2  0.333 -0.4283
## Esteem87_3  0.322  0.0115
## Esteem87_4  0.324 -0.2877
## Esteem87_5  0.315  0.0793
## Esteem87_6  0.347 -0.0492
## Esteem87_7  0.315  0.0196
## Esteem87_8  0.280  0.3619
## Esteem87_9  0.277  0.4917
## Esteem87_10 0.318  0.3918
sum=0
for (val in pc.2$rotation[, "PC1"]){
  sum = sum + val^2
}
sum
## [1] 1

PC1 is all positive. It makes sense since they are the questions that we flippe dTheir values flipped for questions 1, 2, 4, 6, b) Are there good interpretations for PC1 and PC2? (If loadings are all negative, take the positive loadings for the ease of interpretation)

PC1s are entirely positive, so it represents the total esteem of someone surveyed, or their general view towards themselves. PC2 has negative values for question 1, 2, 4, and 6, which are the ones that we flipped earlier.

c) How is the PC1 score obtained for each subject? Write down the formula.

PC1 = 0.324Esteem87_1 + 0.333Esteem87_2 + 0.322Esteem87_3 + 0.324Esteem87_4 + 0.315Esteem87_5 + 0.347Esteem87_6 + 0.315Esteem87_7 + 0.280Esteem87_8 + 0.277Esteem87_9 + 0.318Esteem87_10

d) Are PC1 scores and PC2 scores in the data uncorrelated? 
cor(pc.2$rotation[,1], scale(pc.2$rotation, scale = TRUE))
##      PC1    PC2     PC3   PC4    PC5    PC6   PC7    PC8    PC9   PC10
## [1,]   1 -0.707 -0.0456 0.158 -0.445 0.0163 -0.25 -0.413 -0.152 -0.121
cor(pc.2$rotation[,2], scale(pc.2$rotation, scale = TRUE))
##         PC1 PC2       PC3      PC4      PC5     PC6      PC7      PC8       PC9
## [1,] -0.707   1 -0.000137 0.000474 -0.00133 4.9e-05 -0.00075 -0.00124 -0.000456
##           PC10
## [1,] -0.000364

Yes, all PC scores are uncorrelated and independent.

e) Plot PVE (Proportion of Variance Explained) and summarize the plot. 
summary(pc.2)$importance
##                          PC1   PC2   PC3    PC4    PC5    PC6    PC7    PC8
## Standard deviation     2.166 1.102 0.911 0.8090 0.7699 0.7037 0.6714 0.6346
## Proportion of Variance 0.469 0.121 0.083 0.0654 0.0593 0.0495 0.0451 0.0403
## Cumulative Proportion  0.469 0.590 0.673 0.7388 0.7981 0.8477 0.8927 0.9330
##                           PC9   PC10
## Standard deviation     0.6133 0.5421
## Proportion of Variance 0.0376 0.0294
## Cumulative Proportion  0.9706 1.0000
#Scree plot of variances 
plot(pc.2)

#Scree plot of PVE's
plot(summary(pc.2)$importance[2,],
     ylab="PVE",
     xlab="Number of PC's",
     pch=16,
     main="Scree Plot of PVE for esteem")

The plot shows that PC1 holds almost the majority of variance, whereas PC3 and after account for under 10% of the variability.

f) Also plot CPVE (Cumulative Proportion of Variance Explained). What proportion of the variance in the data is explained by the first two principal components?
#Scree plot of PVE's
plot(summary(pc.2)$importance[3,],
     ylab="Cumulative PVE",
     xlab="Number of PC's",
     pch=16,
     main="Scree Plot of Cumulative PVE for esteem")

Around 60% of the variance in data is explained by the first two principal components.

g) PC’s provide us with a low dimensional view of the self-esteem scores. Use a biplot with the first two PC's to display the data.  Give an interpretation of PC1 and PC2 from the plot. (try `ggbiplot` if you could, much prettier!)
#Scree plot of PVE's
lim <- c(-.1, 0.1)
biplot(pc.2, xlim=lim, ylim=lim, main = "Biplot of the PC's")
abline(v=0, h=0)

PC1 is the sum of all esteem while PC2 is the difference between all esteem questions 8, 9, 10 (positive) and questions 1, 2, 4, and 6 (negative).

  1. Apply k-means to cluster subjects on the original esteem scores

    1. Find a reasonable number of clusters using within sum of squared with elbow rules.

According to elbow rule, 3 clusters may be the optimal number of clusters.

# library("car")
# library("factoextra")
# fviz_nbclust(data.esteem[, c(37, 38, 39, 40, 41, 42, 43, 44, 45, 46)],kmeans, method="wss")
b) Can you summarize common features within each cluster?
esteem.kmeans <- kmeans(data.esteem[, c(37, 38, 39, 40, 41, 42, 43, 44, 45, 46)],centers = 3)
str(esteem.kmeans)
## List of 9
##  $ cluster     : int [1:2431] 2 1 2 3 1 1 1 2 1 1 ...
##  $ centers     : num [1:3, 1:10] 3.91 3.29 3.51 3.9 3.25 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:3] "1" "2" "3"
##   .. ..$ : chr [1:10] "Esteem87_1" "Esteem87_2" "Esteem87_3" "Esteem87_4" ...
##  $ totss       : num 8775
##  $ withinss    : num [1:3] 2241 1523 1597
##  $ tot.withinss: num 5361
##  $ betweenss   : num 3414
##  $ size        : int [1:3] 1125 800 506
##  $ iter        : int 3
##  $ ifault      : int 0
##  - attr(*, "class")= chr "kmeans"
esteem.kmeans$cluster # one label for each team, k=3 many centers
##    [1] 2 1 2 3 1 1 1 2 1 1 1 2 1 1 1 3 1 3 2 2 2 2 3 1 1 1 1 3 1 3 1 3 1 3 1 1 2
##   [38] 2 1 1 1 1 1 2 1 1 3 3 1 1 1 1 1 3 1 2 2 2 1 1 1 1 2 1 1 2 2 2 1 2 3 1 1 2
##   [75] 3 1 1 2 2 1 3 1 1 3 1 2 1 1 1 3 3 3 2 3 2 1 1 1 2 1 1 1 1 1 2 1 1 3 1 1 2
##  [112] 1 1 1 1 2 1 1 2 2 1 1 1 1 3 1 1 1 1 2 1 3 1 1 2 2 2 1 1 1 2 2 2 1 1 1 1 2
##  [149] 2 3 1 1 3 1 2 1 1 3 2 2 2 1 2 1 1 1 1 1 1 1 3 1 2 2 2 1 2 3 2 1 1 2 1 1 1
##  [186] 3 1 1 2 1 1 1 2 1 2 2 1 2 1 2 1 1 1 3 3 1 1 1 3 1 1 3 2 1 3 1 2 2 2 1 1 3
##  [223] 2 1 3 1 2 3 1 1 2 1 1 1 1 1 1 1 1 2 1 1 2 2 1 2 3 2 1 1 3 2 2 1 1 1 2 2 2
##  [260] 2 2 2 2 3 2 3 3 3 2 2 2 3 2 2 2 1 1 3 1 1 1 2 1 1 3 2 1 2 1 1 1 1 1 3 2 3
##  [297] 3 1 2 2 1 3 2 2 2 2 2 2 1 2 2 2 1 1 1 1 1 1 3 2 2 2 1 1 3 2 3 1 2 2 1 1 2
##  [334] 1 2 2 1 1 1 3 3 1 3 1 2 3 3 3 1 1 1 2 2 1 3 2 1 2 2 1 2 3 2 1 1 2 1 1 1 3
##  [371] 2 2 2 2 1 3 3 3 1 2 2 1 1 3 1 2 3 1 1 1 3 1 2 2 2 1 1 1 3 2 2 1 3 2 2 2 1
##  [408] 2 2 2 3 2 2 2 1 1 1 2 2 3 2 2 1 1 1 2 1 1 3 3 2 1 1 1 3 2 3 1 3 2 2 1 3 2
##  [445] 1 2 1 1 1 3 1 1 1 1 3 2 3 3 1 1 1 1 1 3 3 1 3 2 1 3 1 1 1 3 3 2 2 2 3 1 3
##  [482] 2 3 3 1 2 1 2 1 1 2 1 2 2 2 2 2 1 1 2 2 1 1 3 3 1 1 1 3 1 3 3 2 2 1 3 3 1
##  [519] 1 1 2 3 2 2 3 1 1 3 2 1 1 2 1 1 2 2 3 2 2 3 1 1 2 1 1 2 2 1 1 3 3 2 3 2 1
##  [556] 1 3 1 2 1 1 2 2 2 1 1 2 3 3 1 1 2 1 2 1 3 3 1 2 2 2 3 3 1 2 2 1 3 1 1 2 2
##  [593] 2 2 3 1 3 1 1 1 2 2 1 1 1 2 2 1 2 3 3 1 1 2 1 2 1 1 2 1 1 1 1 2 3 2 1 2 2
##  [630] 2 1 2 1 3 3 1 2 1 1 2 2 2 1 1 1 1 2 1 1 1 3 3 1 1 2 2 3 3 2 1 2 3 3 1 1 1
##  [667] 1 1 2 3 1 3 1 1 1 1 3 1 2 1 1 2 3 2 1 1 3 1 1 1 1 2 2 2 2 3 2 3 2 1 2 1 1
##  [704] 2 3 1 1 2 1 1 2 2 3 2 1 2 3 1 1 3 1 3 2 1 2 1 1 1 2 3 1 1 1 2 1 1 2 1 1 2
##  [741] 1 2 3 3 1 2 1 1 2 1 1 3 1 3 1 2 1 2 2 3 1 3 3 3 1 2 3 2 1 1 2 2 3 1 2 1 1
##  [778] 3 1 1 2 3 2 3 2 1 3 1 2 2 2 1 2 1 3 3 1 2 2 1 2 2 3 1 1 1 2 2 3 3 1 3 1 1
##  [815] 1 1 3 2 2 2 3 3 2 2 1 2 2 1 2 2 1 2 2 1 2 1 2 3 2 3 3 3 3 1 1 1 3 3 3 1 1
##  [852] 3 1 3 1 3 1 1 3 3 1 1 1 1 2 1 2 2 1 1 1 2 2 2 2 2 2 1 2 1 1 3 2 1 2 3 1 1
##  [889] 3 2 1 2 3 1 2 1 1 3 1 3 3 2 1 1 1 1 2 1 2 3 2 1 1 1 1 2 3 1 1 3 3 1 3 2 1
##  [926] 3 1 3 2 2 1 1 1 2 2 3 1 1 3 3 1 1 3 2 1 1 1 1 3 3 1 2 1 2 3 3 1 2 3 2 2 1
##  [963] 2 1 3 2 3 2 2 1 1 1 3 3 3 3 3 3 2 1 2 3 1 1 2 2 1 3 1 3 1 2 1 1 2 1 2 1 1
## [1000] 1 3 2 2 2 3 1 1 2 1 1 1 3 2 1 3 1 2 1 1 3 1 1 1 2 2 3 1 1 2 1 3 3 3 3 3 3
## [1037] 3 3 1 1 3 1 3 3 2 1 3 2 3 3 1 3 2 2 1 1 1 3 3 1 1 3 2 3 2 2 2 1 1 1 3 1 1
## [1074] 1 1 3 3 2 2 3 1 1 1 3 2 2 1 1 2 1 2 2 2 1 1 1 3 1 1 1 3 2 2 2 1 3 2 1 1 1
## [1111] 1 3 2 1 1 1 1 3 1 2 1 1 1 1 2 1 3 1 1 2 1 2 1 2 2 2 2 1 1 1 1 2 3 1 3 3 1
## [1148] 3 1 1 1 1 3 3 2 2 3 2 2 3 1 2 3 1 1 1 2 1 1 1 1 1 1 1 1 2 2 2 2 1 1 3 1 1
## [1185] 2 3 1 2 2 1 1 3 3 2 1 2 2 2 2 1 1 2 3 1 2 1 1 2 1 1 1 1 2 2 3 1 2 2 3 1 1
## [1222] 1 1 3 2 1 1 2 2 1 1 2 1 1 2 3 3 2 3 1 2 3 3 1 1 1 1 3 1 2 2 3 2 1 1 1 2 3
## [1259] 3 1 1 1 3 1 1 1 1 3 2 3 1 2 1 1 2 2 1 1 2 3 1 2 1 3 1 2 2 2 1 2 2 2 2 1 1
## [1296] 2 1 2 2 3 2 3 1 1 3 3 2 1 1 3 1 1 1 1 3 2 1 2 1 1 2 1 1 1 2 1 1 2 2 1 2 2
## [1333] 1 2 2 1 1 3 2 3 1 2 1 2 3 2 3 1 1 1 1 1 3 1 1 2 2 1 1 1 2 2 3 1 1 1 2 2 2
## [1370] 2 1 1 1 1 2 1 2 2 1 1 1 1 2 2 2 3 3 1 1 1 1 1 3 1 1 2 1 1 1 3 1 2 3 2 3 2
## [1407] 2 3 2 2 2 2 1 1 2 1 1 1 2 2 2 2 2 2 2 1 2 1 1 3 1 2 3 1 3 1 1 1 1 1 1 2 1
## [1444] 2 3 2 2 3 2 1 2 1 3 3 2 1 1 2 2 2 3 1 1 1 3 3 1 3 3 2 3 1 1 2 1 2 2 2 1 3
## [1481] 2 1 2 1 1 2 3 1 2 3 1 3 3 1 3 1 1 2 2 3 1 1 1 2 1 1 1 1 3 1 1 1 2 2 3 2 1
## [1518] 2 2 1 3 3 1 1 1 2 2 1 1 1 2 1 1 3 2 2 1 1 3 1 2 2 1 3 1 1 3 2 3 1 3 1 2 1
## [1555] 1 3 1 2 1 3 2 2 1 1 1 2 1 1 1 1 2 1 2 1 1 3 1 1 3 1 1 2 3 1 1 3 2 2 2 3 2
## [1592] 2 1 1 3 3 1 1 2 2 1 2 2 2 3 1 1 1 1 1 3 1 1 3 2 2 2 2 2 1 2 3 1 1 3 1 1 1
## [1629] 1 3 1 1 2 1 1 2 2 3 1 2 2 2 1 1 2 1 1 1 1 3 2 1 2 3 2 3 2 1 2 3 2 1 3 1 1
## [1666] 2 1 3 3 1 1 2 2 1 2 1 1 3 2 1 2 1 2 1 1 3 3 3 3 1 2 1 3 1 1 2 1 3 3 3 1 2
## [1703] 1 2 1 3 2 2 2 3 1 1 1 2 2 2 3 2 1 2 2 2 2 1 1 3 1 2 1 3 2 1 2 1 2 2 1 1 2
## [1740] 1 3 1 1 1 1 1 1 1 2 2 1 1 1 2 3 1 3 2 3 1 2 1 1 2 2 2 1 2 1 1 2 3 3 1 3 2
## [1777] 2 2 1 2 3 3 1 2 1 1 3 1 2 2 2 3 3 1 2 1 1 1 1 1 1 3 2 3 1 1 1 1 1 1 1 1 1
## [1814] 1 3 3 2 2 1 2 2 2 1 3 2 2 3 2 3 1 1 3 1 1 1 1 1 1 2 2 2 1 3 1 3 2 2 3 2 1
## [1851] 2 3 1 2 3 3 2 2 2 2 3 2 1 2 1 1 1 3 2 1 2 2 3 2 2 3 2 2 1 1 1 1 2 2 2 2 3
## [1888] 1 1 1 3 3 1 2 1 1 1 3 1 1 3 3 2 2 2 1 2 2 2 2 2 2 1 2 1 2 1 2 2 1 3 2 1 1
## [1925] 2 1 1 1 1 2 1 3 2 1 3 1 2 1 3 2 1 2 1 3 2 1 1 3 1 3 1 2 3 2 2 1 1 1 2 2 2
## [1962] 2 3 2 2 1 1 1 1 2 2 3 1 3 1 2 1 3 1 2 1 3 3 1 3 1 1 2 3 2 2 2 2 1 3 2 2 2
## [1999] 2 3 2 1 1 1 1 1 1 2 1 2 3 1 3 3 1 2 1 3 1 1 1 2 1 2 1 2 2 3 2 2 1 2 2 3 3
## [2036] 1 1 3 2 3 1 2 3 2 1 1 3 1 1 2 1 1 2 1 1 1 2 1 1 2 1 2 1 3 2 2 1 2 2 2 3 1
## [2073] 2 2 2 2 2 3 2 2 2 2 2 3 3 1 1 1 1 3 3 1 2 2 1 1 3 1 1 3 2 1 1 1 3 1 2 1 2
## [2110] 3 1 1 1 3 2 3 2 1 1 3 1 2 1 2 1 2 1 3 1 1 2 1 2 2 1 3 1 1 2 2 1 2 2 2 2 1
## [2147] 2 2 1 3 1 2 2 1 1 3 1 1 2 3 2 1 1 1 3 3 3 3 1 3 2 1 1 1 1 2 2 3 2 2 1 1 3
## [2184] 3 3 2 1 1 1 3 2 2 1 2 2 2 3 2 1 3 1 1 1 3 1 2 2 3 3 1 3 2 2 2 3 1 1 2 2 1
## [2221] 2 2 1 1 3 3 2 2 3 3 1 2 2 1 2 1 2 1 1 2 1 1 1 2 1 1 1 1 3 2 2 3 1 1 1 1 2
## [2258] 1 2 3 3 1 1 2 2 3 1 1 1 3 3 3 2 2 1 1 1 1 3 1 1 2 1 2 1 1 1 3 1 1 1 2 1 2
## [2295] 1 2 3 2 1 2 3 3 2 1 1 1 1 1 1 1 2 2 1 3 3 1 1 3 2 1 1 1 3 1 1 1 1 1 1 1 1
## [2332] 2 1 1 3 3 1 2 2 3 1 2 2 2 2 1 3 1 1 2 1 1 1 1 1 3 2 1 1 1 1 2 2 2 2 2 3 1
## [2369] 3 1 1 2 1 1 3 3 1 2 3 2 1 2 2 3 1 1 1 1 2 1 1 1 3 1 1 1 2 2 3 3 1 3 2 2 3
## [2406] 3 1 3 1 1 3 2 2 1 1 2 2 2 1 1 1 2 2 1 1 1 1 3 1 1 3
esteem.kmeans$size # size of each cluster
## [1] 1125  800  506
Income.clustered <- data.esteem %>% select (Income05) %>%
        mutate(group = esteem.kmeans$cluster) %>%
        arrange(desc(group))
ggplot(Income.clustered, aes(x = Income05, fill = as.factor(group)))+
  geom_histogram(binwidth = 50000, position = "dodge") +
  theme_classic()

AFQT.clustered <- data.esteem %>% select (AFQT) %>%
        mutate(group = esteem.kmeans$cluster) %>%
        arrange(desc(group))
ggplot(AFQT.clustered, aes(x = AFQT, fill = as.factor(group)))+
  geom_histogram(binwidth = 50000, position = "dodge") +
  theme_classic()

c1 <- data.esteem %>%
  mutate(group = esteem.kmeans$cluster)%>%
  filter(group == 1)
ggplot(c1, aes(x = Income05)) +
  geom_histogram(binwidth = 50000)

c2 <- data.esteem %>%
  mutate(group = esteem.kmeans$cluster)%>%
  filter(group == 2)
ggplot(c2, aes(x = Income05)) +
  geom_histogram(binwidth = 50000)

c3 <- data.esteem %>%
  mutate(group = esteem.kmeans$cluster)%>%
  filter(group == 3)
ggplot(c3, aes(x = Income05)) +
  geom_histogram(binwidth = 50000)

The size of the three clusters turn out to be 826, 818, and 787, respectively. The three clusters have similar spreads in terms of income, with cluster 3 being slightly more left skewed, followed by cluster 2 and cluster 1. As for AFQT, group 1 is very left skewed with most individuals having higher scores, group 3 is comparatively left-skewed, while group 2 is right-skewed and more uniform with more people scoring lower.

c) Can you visualize the clusters with somewhat clear boundaries? You may try different pairs of variables and different PC pairs of the esteem scores.
esteem.pca <- prcomp(data.esteem[,c(37, 38, 39, 40, 41, 42, 43, 44, 45, 46)],center= TRUE, scale = TRUE)
library("ggplot2")
esteem.final1 <- data.frame(pc1 = esteem.pca$x[,1], pc2 = esteem.pca$x[,2],
           group = as.factor(esteem.kmeans$cluster))
ggplot(data = esteem.final1, aes(x = pc1, y = pc2, col=group)) + geom_point() + ggtitle("Clustering over PC1 and PC2")

esteem.final2 <- data.frame(Income = data.esteem["FamilyIncome78"], AFQT = data.esteem["AFQT"], group = as.factor(esteem.kmeans$cluster))
ggplot(data = esteem.final2, aes(x = FamilyIncome78, y = AFQT, col=group)) + geom_point() + ggtitle("Clustering over Income and AFQT Score")

esteem.final2 <- data.frame(Education = data.esteem["Education05"], AFQT = data.esteem["Weight05"], group = as.factor(esteem.kmeans$cluster))
ggplot(data = esteem.final2, aes(x = Education05, y = Weight05, col=group)) + geom_point() + ggtitle("Clustering over Income and AFQT Score")

As shown in the plots above, total esteem score is very differentiated across the different clusters. Cluster 1 has the highest esteem, followed by cluster 3, and then cluster 2. However, there is no clear differentiation among the 3 clusters in terms of variables such as weight, education, and income.

  1. We now try to find out what factors are related to self-esteem? PC1 of all the Esteem scores is a good variable to summarize one’s esteem scores. We take PC1 as our response variable.

    1. Prepare possible factors/variables:
    • Personal information: gender, education (05, problematic), log(income) in 87, job type in 87, Body mass index as a measure of health (The BMI is defined as the body mass divided by the square of the body height, and is universally expressed in units of kg/m²). Since BMI is measured in 05, this will not be a good practice to be inclueded as possible variables.

    • Household environment: Imagazine, Inewspaper, Ilibrary, MotherEd, FatherEd, FamilyIncome78. Do set indicators Imagazine, Inewspaper and Ilibrary as factors.

    • Use PC1 of SVABS as level of intelligence

data.esteem %>%
  select(Income87)
##      Income87
## 1       16000
## 2       18000
## 3           0
## 4        9000
## 5       15000
## 6        2200
## 7       27000
## 8       20000
## 9       28000
## 10      27000
## 11      18500
## 12       9500
## 13      17000
## 14      30000
## 15      16300
## 16      13000
## 17      26500
## 18      20000
## 19          0
## 20          0
## 21      18000
## 22      31000
## 23      17000
## 24      17900
## 25      18500
## 26      19070
## 27      14000
## 28      19450
## 29          0
## 30      18000
## 31      22000
## 32      14000
## 33      12561
## 34      23000
## 35      12000
## 36      30000
## 37      22000
## 38      59387
## 39       7000
## 40      26392
## 41      31000
## 42      13000
## 43      32000
## 44      14000
## 45       2000
## 46      17700
## 47       2000
## 48       8000
## 49      16000
## 50         -1
## 51          0
## 52      18000
## 53      14000
## 54      18000
## 55      30000
## 56      24000
## 57      19500
## 58      14500
## 59       5000
## 60      14450
## 61      12000
## 62       9000
## 63          0
## 64      22500
## 65      17500
## 66      11000
## 67      37000
## 68      11000
## 69      18000
## 70      24000
## 71          0
## 72      29000
## 73       2100
## 74      14000
## 75      17000
## 76      59387
## 77      33000
## 78          0
## 79          0
## 80       3000
## 81      14500
## 82      13000
## 83      17500
## 84       9900
## 85      15800
## 86          0
## 87       4700
## 88      11200
## 89      16750
## 90       5723
## 91      13000
## 92      11000
## 93       5000
## 94        250
## 95       7000
## 96      21000
## 97      30000
## 98      59387
## 99      20000
## 100      8200
## 101     20203
## 102     26000
## 103     18000
## 104     12000
## 105     18000
## 106     14000
## 107      8500
## 108         0
## 109     10000
## 110      2000
## 111     20000
## 112     15000
## 113     17800
## 114     24600
## 115     59387
## 116         0
## 117     25000
## 118     14500
## 119     19000
## 120     26300
## 121     19000
## 122     59387
## 123     10000
## 124     27000
## 125     20000
## 126     19700
## 127     20000
## 128     25000
## 129     17000
## 130     14000
## 131      6800
## 132        50
## 133     14000
## 134     20000
## 135     12000
## 136     22000
## 137     16000
## 138        -2
## 139     59387
## 140     13500
## 141     19000
## 142     20900
## 143         0
## 144     39000
## 145     26000
## 146     15000
## 147     39500
## 148      1000
## 149     13000
## 150     35000
## 151     24000
## 152     22000
## 153      4000
## 154      2800
## 155     10000
## 156      3000
## 157     59387
## 158     10000
## 159     23000
## 160      7000
## 161      7000
## 162     22000
## 163      1500
## 164     11000
## 165     10000
## 166      3300
## 167         0
## 168     10686
## 169     25000
## 170     19500
## 171     17750
## 172     35000
## 173     15800
## 174     26000
## 175     24000
## 176      6300
## 177         0
## 178         0
## 179     22000
## 180     36000
## 181      6000
## 182      7000
## 183     25000
## 184     27000
## 185      8000
## 186      2000
## 187         0
## 188      8400
## 189      8300
## 190     12000
## 191     32000
## 192         0
## 193     10300
## 194     18000
## 195     28500
## 196     16500
## 197     15500
## 198     12000
## 199     16000
## 200     59387
## 201     11000
## 202      5000
## 203         0
## 204     17000
## 205         0
## 206     59387
## 207      4000
## 208     14000
## 209     20000
## 210     13650
## 211      8000
## 212     13300
## 213     15000
## 214     28450
## 215       800
## 216     25000
## 217      9900
## 218      4000
## 219     18000
## 220         0
## 221      3500
## 222     16000
## 223     16500
## 224     10000
## 225         0
## 226         0
## 227     15500
## 228     21000
## 229     11400
## 230     32000
## 231      2570
## 232     22000
## 233     30000
## 234     26000
## 235     27000
## 236     21000
## 237     12000
## 238      5000
## 239     25000
## 240     10000
## 241     16004
## 242     26000
## 243     12500
## 244       975
## 245      7000
## 246      1100
## 247     13000
## 248     32000
## 249      7700
## 250         0
## 251     30000
## 252      3000
## 253        -2
## 254     30000
## 255     26000
## 256      1496
## 257     18000
## 258      8000
## 259      1300
## 260      5000
## 261     10400
## 262         0
## 263      8000
## 264      5000
## 265     18500
## 266      7000
## 267      1592
## 268     12000
## 269     13500
## 270     59387
## 271     26000
## 272      1500
## 273     14500
## 274     15000
## 275      9000
## 276      7500
## 277     10350
## 278     35000
## 279     27000
## 280     10000
## 281         0
## 282         0
## 283       500
## 284      8000
## 285      1500
## 286         0
## 287      9600
## 288     21000
## 289      4000
## 290     31000
## 291         0
## 292         0
## 293     15000
## 294     12000
## 295     26000
## 296       700
## 297     20000
## 298     28000
## 299     59387
## 300      3500
## 301     19500
## 302       250
## 303         0
## 304       920
## 305     10000
## 306     20500
## 307     31000
## 308     10500
## 309      4000
## 310      8000
## 311         0
## 312     13000
## 313     10000
## 314     10000
## 315      9000
## 316      7000
## 317     36000
## 318     35000
## 319     21000
## 320      3000
## 321      4000
## 322     15000
## 323      3000
## 324     26000
## 325      6000
## 326     24800
## 327         0
## 328      3100
## 329     16000
## 330         0
## 331      9300
## 332         0
## 333     32000
## 334         0
## 335      7300
## 336      6000
## 337     18343
## 338      6400
## 339     30000
## 340      4600
## 341      7000
## 342     20000
## 343         0
## 344     36000
## 345     33000
## 346     16000
## 347      7000
## 348     18000
## 349     14000
## 350      3000
## 351         0
## 352     11000
## 353      7000
## 354      5000
## 355     24000
## 356     15000
## 357     25000
## 358     22000
## 359     18000
## 360     59387
## 361     14600
## 362      6000
## 363         0
## 364     15000
## 365     15000
## 366     21000
## 367        -2
## 368     21000
## 369     11000
## 370     21100
## 371     34000
## 372     13000
## 373     14000
## 374     59387
## 375     15000
## 376     28000
## 377     14000
## 378     30000
## 379     20000
## 380     59387
## 381     18000
## 382     20000
## 383     25000
## 384     12000
## 385     20000
## 386     30000
## 387     15000
## 388     22000
## 389     15400
## 390      7000
## 391     17000
## 392     16700
## 393     15000
## 394     17000
## 395     13000
## 396      9000
## 397     15000
## 398      3500
## 399     20000
## 400     59387
## 401     14600
## 402         0
## 403      2000
## 404     25000
## 405      3250
## 406      4992
## 407     13000
## 408      6000
## 409         0
## 410      1200
## 411      4000
## 412      1000
## 413      9000
## 414     18540
## 415     14000
## 416      9000
## 417      5000
## 418     10000
## 419     17000
## 420     10000
## 421         0
## 422         0
## 423     21000
## 424     12000
## 425     20800
## 426     14000
## 427     17000
## 428     59387
## 429     17000
## 430     10000
## 431      9000
## 432      1800
## 433     12500
## 434      6000
## 435     21000
## 436     15000
## 437     10000
## 438     32700
## 439         0
## 440     59387
## 441     27000
## 442      2114
## 443     18000
## 444     10000
## 445     13000
## 446     16000
## 447     10000
## 448     22000
## 449     16800
## 450      3000
## 451         0
## 452     21000
## 453     16000
## 454     35000
## 455       400
## 456     12000
## 457      4800
## 458      2600
## 459     24500
## 460     14000
## 461     17000
## 462     20000
## 463     15000
## 464     13000
## 465      2880
## 466      3000
## 467     20000
## 468      5700
## 469     11500
## 470     16000
## 471     38000
## 472     15000
## 473     17000
## 474      9000
## 475     27000
## 476      9000
## 477     13000
## 478      9500
## 479      2121
## 480     12300
## 481         0
## 482      6000
## 483         0
## 484     14000
## 485      3500
## 486      9100
## 487     36000
## 488     10000
## 489     28160
## 490        -1
## 491      1500
## 492      2000
## 493     25000
## 494     11000
## 495         0
## 496     18553
## 497      6000
## 498     10500
## 499     10500
## 500      7000
## 501     16000
## 502     12000
## 503      2200
## 504     15000
## 505     16900
## 506     19960
## 507     15000
## 508     59387
## 509       700
## 510      7000
## 511     12500
## 512     19000
## 513     21000
## 514     19000
## 515     18200
## 516     11800
## 517         0
## 518     27000
## 519     20000
## 520     10000
## 521     14000
## 522     21000
## 523      9000
## 524     25200
## 525     20000
## 526     20000
## 527     36000
## 528     15000
## 529      6000
## 530     13000
## 531         0
## 532     10000
## 533      6000
## 534      3000
## 535     19000
## 536     17000
## 537         0
## 538      7000
## 539      1300
## 540     23000
## 541      8000
## 542     20500
## 543     20000
## 544     19000
## 545     11500
## 546     21000
## 547         0
## 548     10515
## 549         0
## 550     11700
## 551        -1
## 552     30000
## 553     21000
## 554     13000
## 555     19000
## 556     29000
## 557     30000
## 558     13000
## 559      6000
## 560      8685
## 561     15000
## 562        -2
## 563      1000
## 564      7400
## 565         0
## 566     11000
## 567      3100
## 568      9000
## 569      8000
## 570     21000
## 571     12000
## 572      5000
## 573     19000
## 574     33000
## 575     19000
## 576     19526
## 577         0
## 578     14500
## 579     10000
## 580      2000
## 581     13000
## 582     10500
## 583     20000
## 584     11000
## 585         0
## 586     16000
## 587     14000
## 588     13000
## 589     11000
## 590      9000
## 591     21000
## 592     10046
## 593     16000
## 594      6500
## 595      7000
## 596     12000
## 597       360
## 598     14300
## 599     10000
## 600     10000
## 601     35000
## 602     25000
## 603     24000
## 604     18000
## 605      7000
## 606      8300
## 607      3500
## 608     10000
## 609         0
## 610     33600
## 611     11000
## 612      8040
## 613     22000
## 614     15500
## 615     20000
## 616      9000
## 617      9000
## 618      2600
## 619      6500
## 620     15000
## 621     32000
## 622         0
## 623     14000
## 624      3500
## 625      9000
## 626     19000
## 627         0
## 628     10000
## 629      2200
## 630     14600
## 631     15000
## 632     16300
## 633     16000
## 634         0
## 635     18000
## 636     35000
## 637      3300
## 638     17000
## 639     24000
## 640     20000
## 641     30000
## 642     14000
## 643     16472
## 644     34000
## 645      1800
## 646      1000
## 647     12000
## 648     23000
## 649     24000
## 650      7000
## 651      6000
## 652      1800
## 653      3471
## 654      6000
## 655     13000
## 656       200
## 657         0
## 658      2000
## 659      7600
## 660      8000
## 661     20000
## 662      1100
## 663         0
## 664         0
## 665     13000
## 666     10000
## 667     12000
## 668     11000
## 669         0
## 670     18000
## 671     10000
## 672     20000
## 673     11000
## 674     25000
## 675     14000
## 676     14800
## 677      2000
## 678     21829
## 679      5902
## 680     20000
## 681      8000
## 682     13000
## 683      7000
## 684         0
## 685     22000
## 686         0
## 687     31000
## 688     12000
## 689     15000
## 690     20000
## 691      5700
## 692      8400
## 693     11000
## 694     12000
## 695      5600
## 696     14900
## 697         0
## 698      3000
## 699     13000
## 700      1000
## 701     31000
## 702         0
## 703     18000
## 704     14000
## 705      3000
## 706     16900
## 707      8100
## 708         0
## 709     11000
## 710     26500
## 711     13500
## 712     14000
## 713         0
## 714     21000
## 715     38000
## 716         0
## 717        -2
## 718      4375
## 719      5000
## 720     15000
## 721     18800
## 722       332
## 723     31000
## 724     24500
## 725         0
## 726         0
## 727     15000
## 728     23000
## 729         0
## 730     21222
## 731     10000
## 732      1600
## 733         0
## 734     18000
## 735     25000
## 736     17000
## 737     10000
## 738     23000
## 739     25000
## 740         0
## 741     14000
## 742      6000
## 743     20000
## 744     12000
## 745     14076
## 746         0
## 747         0
## 748     15400
## 749     16000
## 750     11700
## 751      5200
## 752         0
## 753     18000
## 754      7800
## 755        -1
## 756         0
## 757      3500
## 758         0
## 759      3000
## 760     32000
## 761     32000
## 762     14000
## 763      5866
## 764      2500
## 765     14000
## 766     19000
## 767      8000
## 768      4500
## 769         0
## 770      9543
## 771     11559
## 772      9000
## 773     10000
## 774      7600
## 775         0
## 776      8000
## 777     17000
## 778     30149
## 779     11000
## 780      8000
## 781      8500
## 782     20000
## 783     10400
## 784     11000
## 785     16000
## 786     25000
## 787       200
## 788     10000
## 789     18000
## 790      4654
## 791     22000
## 792     17000
## 793     17300
## 794     10000
## 795       788
## 796       600
## 797     11000
## 798     33000
## 799     22000
## 800      4000
## 801      2400
## 802     24000
## 803     14000
## 804     15000
## 805     20101
## 806     20000
## 807     15000
## 808     11000
## 809     12000
## 810      6000
## 811     22500
## 812     24000
## 813     15000
## 814     13600
## 815     59387
## 816      7000
## 817      4900
## 818     19500
## 819     24500
## 820       900
## 821      6000
## 822         0
## 823      4000
## 824     14000
## 825      4000
## 826     19000
## 827     17000
## 828     14000
## 829         0
## 830     17000
## 831     12000
## 832     20000
## 833       900
## 834         0
## 835     15000
## 836      8000
## 837      5200
## 838       500
## 839      8500
## 840         0
## 841      8800
## 842     11000
## 843      5550
## 844     35000
## 845     35000
## 846     17000
## 847       400
## 848      5000
## 849     21500
## 850         0
## 851      7000
## 852     13500
## 853         0
## 854     20000
## 855     21000
## 856     25000
## 857      9500
## 858        -1
## 859     13000
## 860      5600
## 861     10600
## 862     17300
## 863     14000
## 864      3000
## 865     10000
## 866      4500
## 867     10000
## 868       200
## 869     28900
## 870      8000
## 871     59387
## 872      3700
## 873     22000
## 874     11000
## 875     12000
## 876     14000
## 877      1300
## 878     12500
## 879     20000
## 880     15000
## 881     20000
## 882      2000
## 883     11000
## 884     10000
## 885     10000
## 886         0
## 887     16200
## 888         0
## 889      5000
## 890     14000
## 891     13000
## 892         0
## 893      8000
## 894     20000
## 895      5000
## 896      6000
## 897      4000
## 898      1150
## 899     29900
## 900     15000
## 901      5000
## 902     25000
## 903     25000
## 904     25500
## 905     13000
## 906         0
## 907     17000
## 908     18000
## 909     10000
## 910     32000
## 911     14000
## 912     59387
## 913     26810
## 914     25500
## 915     18000
## 916        -2
## 917         0
## 918     59387
## 919      8000
## 920     15000
## 921       660
## 922     15000
## 923      1844
## 924     32000
## 925     13600
## 926         0
## 927         0
## 928         0
## 929     22000
## 930      1200
## 931         0
## 932     15000
## 933      6120
## 934     28000
## 935     38000
## 936     28700
## 937     13000
## 938     12000
## 939         0
## 940       100
## 941         0
## 942     18000
## 943     15000
## 944     26000
## 945     25000
## 946     15000
## 947     14000
## 948         0
## 949     12000
## 950      5450
## 951      4000
## 952     29000
## 953     11000
## 954         0
## 955       320
## 956     10000
## 957         0
## 958     13500
## 959     16000
## 960     25000
## 961     12000
## 962     13000
## 963     26000
## 964     14000
## 965     10000
## 966      1140
## 967      2000
## 968      5200
## 969         0
## 970      7000
## 971     23000
## 972         0
## 973     17000
## 974     12000
## 975         0
## 976       889
## 977      2800
## 978      5000
## 979      9000
## 980     11000
## 981     13000
## 982     13500
## 983     15000
## 984     25600
## 985     10000
## 986     15000
## 987     23660
## 988     14044
## 989      4526
## 990         0
## 991      5000
## 992     14000
## 993         0
## 994         0
## 995      7000
## 996     12317
## 997     20000
## 998     25000
## 999      7500
## 1000    10500
## 1001     4930
## 1002    22000
## 1003    11000
## 1004        0
## 1005    22000
## 1006     7600
## 1007    12000
## 1008     6800
## 1009     3000
## 1010    59387
## 1011    14500
## 1012    18000
## 1013    19000
## 1014     3500
## 1015     2000
## 1016        0
## 1017        0
## 1018        0
## 1019    25000
## 1020        0
## 1021     8000
## 1022    14698
## 1023    17000
## 1024    16000
## 1025    18000
## 1026    59387
## 1027     9000
## 1028    16000
## 1029    15000
## 1030     2000
## 1031     8000
## 1032     6800
## 1033    14000
## 1034     8000
## 1035        0
## 1036        0
## 1037        0
## 1038     6100
## 1039     9000
## 1040     8756
## 1041     1500
## 1042     5600
## 1043     9000
## 1044     3000
## 1045    28000
## 1046        0
## 1047        0
## 1048    11932
## 1049     1977
## 1050    14000
## 1051    36500
## 1052    11500
## 1053    13000
## 1054    14000
## 1055    15000
## 1056    12000
## 1057    12000
## 1058        0
## 1059     4000
## 1060    18500
## 1061    59387
## 1062      200
## 1063      600
## 1064    18000
## 1065     8300
## 1066        0
## 1067     9000
## 1068    15108
## 1069    13000
## 1070     9000
## 1071    11000
## 1072    23000
## 1073    13200
## 1074     8000
## 1075    15000
## 1076     6800
## 1077    15000
## 1078    16000
## 1079     9800
## 1080        0
## 1081    17000
## 1082    11000
## 1083    10255
## 1084     9000
## 1085    13000
## 1086        0
## 1087    10000
## 1088    24000
## 1089    10000
## 1090    24000
## 1091    12000
## 1092    10000
## 1093      160
## 1094    21000
## 1095    12000
## 1096      895
## 1097    10500
## 1098    11000
## 1099    21000
## 1100    18000
## 1101    24000
## 1102    11000
## 1103     9000
## 1104     9500
## 1105     4100
## 1106    11400
## 1107    22000
## 1108    17500
## 1109    17000
## 1110    59387
## 1111    18000
## 1112     2800
## 1113    21000
## 1114    23000
## 1115    26000
## 1116    32000
## 1117    25500
## 1118    28000
## 1119    15000
## 1120     7400
## 1121    35000
## 1122    13000
## 1123    32000
## 1124    21000
## 1125    14000
## 1126    59387
## 1127    30000
## 1128     3000
## 1129    59387
## 1130        0
## 1131    13000
## 1132    23000
## 1133    31000
## 1134     2000
## 1135     2000
## 1136    16000
## 1137    18500
## 1138    17000
## 1139    23000
## 1140    59387
## 1141    17000
## 1142    20600
## 1143    25500
## 1144     7000
## 1145     2880
## 1146    19000
## 1147    19000
## 1148        0
## 1149    11000
## 1150    15000
## 1151    28000
## 1152     7000
## 1153     1800
## 1154    25000
## 1155    33200
## 1156        0
## 1157    23000
## 1158    13000
## 1159    24000
## 1160        0
## 1161        0
## 1162    25500
## 1163    15100
## 1164    35000
## 1165     1600
## 1166    29000
## 1167    18000
## 1168    59387
## 1169        0
## 1170    19000
## 1171    22000
## 1172    17000
## 1173    25000
## 1174    25000
## 1175    26000
## 1176    15000
## 1177    25000
## 1178     1200
## 1179    11000
## 1180    39000
## 1181    20000
## 1182    22000
## 1183    33000
## 1184    18500
## 1185        0
## 1186        0
## 1187     5000
## 1188    22000
## 1189    13900
## 1190     8000
## 1191    11600
## 1192        0
## 1193    15000
## 1194    59387
## 1195     5200
## 1196    20000
## 1197        0
## 1198    11000
## 1199    11000
## 1200       -1
## 1201    18000
## 1202    11950
## 1203    13800
## 1204     2400
## 1205      200
## 1206     9600
## 1207    26000
## 1208     6876
## 1209     1100
## 1210    21000
## 1211    10500
## 1212     3500
## 1213    14000
## 1214     3528
## 1215    15000
## 1216    31000
## 1217    14000
## 1218    34500
## 1219    16000
## 1220     7500
## 1221    17000
## 1222    18500
## 1223    21000
## 1224    26400
## 1225        0
## 1226    35000
## 1227    25000
## 1228    13350
## 1229    17000
## 1230    12000
## 1231        0
## 1232        0
## 1233    18000
## 1234      500
## 1235     4500
## 1236     6000
## 1237        0
## 1238    18000
## 1239    21000
## 1240    32000
## 1241        0
## 1242     1000
## 1243    10000
## 1244    17000
## 1245     3600
## 1246    32000
## 1247    14200
## 1248     3000
## 1249    20000
## 1250    13500
## 1251    24900
## 1252        0
## 1253    59387
## 1254    10000
## 1255    18000
## 1256     7000
## 1257    13000
## 1258     5400
## 1259     7000
## 1260    11101
## 1261     8000
## 1262    16000
## 1263    15500
## 1264    25000
## 1265    19000
## 1266    21000
## 1267    10000
## 1268     2869
## 1269     6800
## 1270    12000
## 1271    13000
## 1272     1600
## 1273    27500
## 1274    14000
## 1275    14000
## 1276    12000
## 1277        0
## 1278    28600
## 1279     8000
## 1280    17000
## 1281     9000
## 1282    13000
## 1283    24700
## 1284    25000
## 1285    26000
## 1286     2000
## 1287    59387
## 1288    12000
## 1289    18000
## 1290    30000
## 1291    11000
## 1292       -2
## 1293       -1
## 1294      900
## 1295    23820
## 1296        0
## 1297        0
## 1298    21000
## 1299     3000
## 1300    23000
## 1301    16700
## 1302    16000
## 1303    12000
## 1304    23000
## 1305    11800
## 1306    16000
## 1307    14000
## 1308     7890
## 1309    16000
## 1310     3000
## 1311    25000
## 1312    16000
## 1313     8000
## 1314      200
## 1315    23000
## 1316    34000
## 1317    59387
## 1318     3200
## 1319    15008
## 1320    30000
## 1321    25500
## 1322      800
## 1323    59387
## 1324    17000
## 1325     8000
## 1326      300
## 1327    14500
## 1328        0
## 1329    21000
## 1330        0
## 1331     8000
## 1332    15000
## 1333     1500
## 1334     1200
## 1335    18000
## 1336    15000
## 1337    25000
## 1338    23000
## 1339    20000
## 1340    16000
## 1341    26000
## 1342        0
## 1343    32000
## 1344    12136
## 1345       -2
## 1346       -2
## 1347       -2
## 1348    15500
## 1349     6000
## 1350    12000
## 1351    31000
## 1352    15000
## 1353     9000
## 1354     5000
## 1355     5000
## 1356    28000
## 1357    22000
## 1358     7200
## 1359    16000
## 1360    16200
## 1361    27500
## 1362    12000
## 1363     4600
## 1364    26500
## 1365    32000
## 1366     8000
## 1367    14500
## 1368    22500
## 1369    59387
## 1370    15000
## 1371    16000
## 1372        0
## 1373    18780
## 1374    25000
## 1375    27000
## 1376    15000
## 1377    10400
## 1378    14000
## 1379    12000
## 1380    14000
## 1381    33500
## 1382    16000
## 1383    12500
## 1384    21500
## 1385    13500
## 1386    20000
## 1387     6000
## 1388    59387
## 1389    12000
## 1390    24600
## 1391    11000
## 1392        0
## 1393        0
## 1394    10600
## 1395    10940
## 1396    15900
## 1397     1400
## 1398     1500
## 1399    24090
## 1400     5283
## 1401    14000
## 1402     7500
## 1403    12000
## 1404    14500
## 1405     5000
## 1406    11000
## 1407    24600
## 1408    15400
## 1409    24000
## 1410    23000
## 1411     8000
## 1412    23000
## 1413    18000
## 1414    28000
## 1415     2000
## 1416    10000
## 1417    36000
## 1418    24000
## 1419    26000
## 1420      300
## 1421    31000
## 1422    22000
## 1423     1500
## 1424    26286
## 1425    29087
## 1426     9500
## 1427     2500
## 1428    35000
## 1429        0
## 1430     2000
## 1431     5200
## 1432     5000
## 1433     7220
## 1434    14000
## 1435    10100
## 1436    13000
## 1437    16000
## 1438     2200
## 1439    22000
## 1440     1200
## 1441    59387
## 1442     6000
## 1443     8000
## 1444     1000
## 1445     3000
## 1446    20000
## 1447     3200
## 1448       -2
## 1449    10000
## 1450    20000
## 1451    11000
## 1452    27000
## 1453    18500
## 1454     4424
## 1455    26000
## 1456    17400
## 1457    10000
## 1458    11000
## 1459    15600
## 1460        0
## 1461        0
## 1462      900
## 1463    18500
## 1464    24000
## 1465    15000
## 1466     6000
## 1467     8000
## 1468    11000
## 1469     5500
## 1470     3200
## 1471    30000
## 1472    10000
## 1473    22000
## 1474    17000
## 1475    20000
## 1476    30000
## 1477    25000
## 1478     2500
## 1479    16500
## 1480     1280
## 1481     6000
## 1482    20000
## 1483    12500
## 1484     4000
## 1485    11000
## 1486    11000
## 1487     9636
## 1488    10000
## 1489    11500
## 1490     1000
## 1491    14000
## 1492    14500
## 1493    15000
## 1494     9000
## 1495     8000
## 1496     4000
## 1497    12000
## 1498    10000
## 1499    14000
## 1500    13000
## 1501    19000
## 1502    10000
## 1503    10189
## 1504    15500
## 1505    15500
## 1506    16500
## 1507     7500
## 1508    10400
## 1509    24000
## 1510    59387
## 1511     4000
## 1512     4000
## 1513     3800
## 1514        0
## 1515        0
## 1516     1050
## 1517    15000
## 1518     8800
## 1519    17500
## 1520    12000
## 1521    25053
## 1522     1100
## 1523    16000
## 1524    36000
## 1525    14200
## 1526     7000
## 1527    30000
## 1528     8000
## 1529     2500
## 1530    14000
## 1531        0
## 1532     8000
## 1533     2360
## 1534    10500
## 1535    22000
## 1536        0
## 1537    24000
## 1538    23000
## 1539      835
## 1540    21000
## 1541    16000
## 1542     4000
## 1543     7000
## 1544    20500
## 1545    26000
## 1546    22000
## 1547      500
## 1548    59387
## 1549    15000
## 1550     2300
## 1551    14500
## 1552    12500
## 1553    20100
## 1554     7100
## 1555    17000
## 1556    16000
## 1557        0
## 1558    15300
## 1559    15000
## 1560    24000
## 1561      900
## 1562        0
## 1563    16000
## 1564    10500
## 1565    12000
## 1566    29010
## 1567    25000
## 1568    30000
## 1569    14000
## 1570        0
## 1571    35000
## 1572        0
## 1573    59387
## 1574    12000
## 1575        0
## 1576     5000
## 1577    20000
## 1578    15000
## 1579    13500
## 1580     8700
## 1581    10000
## 1582      325
## 1583      400
## 1584    23000
## 1585    22000
## 1586     6000
## 1587        0
## 1588    11000
## 1589     4000
## 1590    12000
## 1591    20000
## 1592    11000
## 1593    18000
## 1594    11000
## 1595    12000
## 1596      952
## 1597     9000
## 1598    25000
## 1599    12090
## 1600    17000
## 1601    12000
## 1602    19000
## 1603     1050
## 1604    21000
## 1605     6000
## 1606        0
## 1607     7000
## 1608    13000
## 1609        0
## 1610    24000
## 1611      300
## 1612    20000
## 1613    12480
## 1614    10000
## 1615    11000
## 1616    13000
## 1617    12500
## 1618    11000
## 1619     5000
## 1620        0
## 1621    59387
## 1622      600
## 1623    14700
## 1624    19000
## 1625        0
## 1626    13000
## 1627    16735
## 1628    14500
## 1629    12000
## 1630        0
## 1631        0
## 1632     8300
## 1633    10000
## 1634    20000
## 1635    18000
## 1636    20000
## 1637    10000
## 1638    12000
## 1639      500
## 1640    13000
## 1641     9360
## 1642    19200
## 1643    14000
## 1644     9900
## 1645    33000
## 1646     8000
## 1647    14000
## 1648    30000
## 1649    14000
## 1650     6000
## 1651     1000
## 1652    13704
## 1653     8280
## 1654     4500
## 1655    18700
## 1656     4500
## 1657    13000
## 1658     2000
## 1659    11000
## 1660    10000
## 1661     5000
## 1662    13312
## 1663    26400
## 1664     5000
## 1665    20000
## 1666     1400
## 1667    20000
## 1668     2000
## 1669    25000
## 1670    20000
## 1671    17000
## 1672    22000
## 1673     6200
## 1674    24750
## 1675    18000
## 1676    12000
## 1677     2500
## 1678    12000
## 1679        0
## 1680    22000
## 1681       -2
## 1682    14000
## 1683        0
## 1684    19000
## 1685    12000
## 1686     2000
## 1687    11300
## 1688        0
## 1689     4800
## 1690     1500
## 1691    27500
## 1692    15642
## 1693     4000
## 1694     5000
## 1695    17000
## 1696    21900
## 1697    10500
## 1698    10000
## 1699     6000
## 1700        0
## 1701    29000
## 1702     3000
## 1703     9000
## 1704      150
## 1705        0
## 1706    16000
## 1707    18000
## 1708    12800
## 1709        0
## 1710    17000
## 1711     9000
## 1712    19000
## 1713    59387
## 1714    30000
## 1715        0
## 1716     5000
## 1717    15500
## 1718    24000
## 1719    31000
## 1720    11000
## 1721      900
## 1722        0
## 1723     5780
## 1724    14000
## 1725      626
## 1726    33000
## 1727    24500
## 1728    27000
## 1729     3500
## 1730     2000
## 1731        0
## 1732    19584
## 1733    25000
## 1734    22000
## 1735     8000
## 1736     4600
## 1737     7363
## 1738    16000
## 1739    13000
## 1740    15000
## 1741    10500
## 1742    13000
## 1743    27000
## 1744     6470
## 1745    33000
## 1746        0
## 1747     6500
## 1748     2400
## 1749     6000
## 1750     7000
## 1751    35000
## 1752      500
## 1753    22500
## 1754        0
## 1755    14000
## 1756    13000
## 1757    15600
## 1758        0
## 1759    24000
## 1760    13600
## 1761    39000
## 1762        0
## 1763    35000
## 1764    15000
## 1765    32460
## 1766    18000
## 1767    20000
## 1768     3000
## 1769    32000
## 1770    31000
## 1771    10000
## 1772        0
## 1773    16000
## 1774    15000
## 1775        0
## 1776     3000
## 1777        0
## 1778       -2
## 1779     2500
## 1780        0
## 1781     3500
## 1782     2200
## 1783    17000
## 1784    11600
## 1785    26000
## 1786    24000
## 1787     2300
## 1788     5147
## 1789     6000
## 1790    20000
## 1791        0
## 1792    11200
## 1793    17000
## 1794     1900
## 1795    17000
## 1796    59387
## 1797    19000
## 1798    32000
## 1799    21000
## 1800    18000
## 1801     7000
## 1802    17000
## 1803    15000
## 1804     3500
## 1805    16000
## 1806     6000
## 1807     3000
## 1808     4000
## 1809    59387
## 1810    26000
## 1811    25600
## 1812        0
## 1813    25000
## 1814        0
## 1815        0
## 1816        0
## 1817        0
## 1818        0
## 1819    18000
## 1820     9000
## 1821    14000
## 1822    23000
## 1823    30000
## 1824    27000
## 1825    18600
## 1826    18000
## 1827     7800
## 1828     8000
## 1829    14000
## 1830    25000
## 1831        0
## 1832     2500
## 1833    59387
## 1834    35000
## 1835     3000
## 1836    32000
## 1837    38500
## 1838    32000
## 1839    10000
## 1840    25000
## 1841     7000
## 1842    15000
## 1843        0
## 1844    15500
## 1845     3500
## 1846    30000
## 1847    18000
## 1848     2500
## 1849        0
## 1850     6500
## 1851        0
## 1852    13000
## 1853    35000
## 1854        0
## 1855      120
## 1856        0
## 1857    11800
## 1858     9000
## 1859    23000
## 1860     9800
## 1861    20000
## 1862    21000
## 1863     5000
## 1864     6000
## 1865    24000
## 1866        0
## 1867    22000
## 1868    10000
## 1869    10000
## 1870        0
## 1871    12500
## 1872     5000
## 1873     1400
## 1874    10000
## 1875     1300
## 1876    11000
## 1877    20000
## 1878     7000
## 1879    22700
## 1880     2000
## 1881     6500
## 1882     1500
## 1883    39000
## 1884    10000
## 1885     8000
## 1886        0
## 1887    30000
## 1888    14000
## 1889    12000
## 1890       -1
## 1891    10000
## 1892     2000
## 1893    15000
## 1894    18000
## 1895        0
## 1896    15000
## 1897        0
## 1898        0
## 1899        0
## 1900    12200
## 1901        0
## 1902        0
## 1903     2500
## 1904        0
## 1905        0
## 1906     8000
## 1907     4200
## 1908    15200
## 1909    59387
## 1910    12000
## 1911        0
## 1912     2000
## 1913     9000
## 1914     6000
## 1915    18000
## 1916      600
## 1917    23000
## 1918     1169
## 1919    27000
## 1920    13600
## 1921    21000
## 1922     7500
## 1923    18000
## 1924    22000
## 1925    15000
## 1926    11000
## 1927    59387
## 1928    17000
## 1929    29700
## 1930     3500
## 1931    13000
## 1932    19000
## 1933        0
## 1934        0
## 1935     8000
## 1936     3000
## 1937    13000
## 1938    30000
## 1939     1300
## 1940     1300
## 1941    15600
## 1942    15500
## 1943    10000
## 1944     8000
## 1945     9771
## 1946     8160
## 1947        0
## 1948     7000
## 1949     1500
## 1950     6000
## 1951        0
## 1952    11500
## 1953    27000
## 1954    10000
## 1955     5000
## 1956    13000
## 1957       50
## 1958    22500
## 1959    12000
## 1960    18000
## 1961     1700
## 1962    11200
## 1963    20000
## 1964    28000
## 1965        0
## 1966    24000
## 1967     8000
## 1968    10000
## 1969    13000
## 1970    13000
## 1971    59387
## 1972    20100
## 1973    20000
## 1974       88
## 1975        0
## 1976    16500
## 1977    20000
## 1978    26000
## 1979        0
## 1980    59387
## 1981     7200
## 1982        0
## 1983    59387
## 1984    16000
## 1985    19200
## 1986    12500
## 1987    31000
## 1988    35000
## 1989    11000
## 1990     5495
## 1991    11000
## 1992     6000
## 1993     8400
## 1994    15000
## 1995    14500
## 1996    10900
## 1997    15200
## 1998       -2
## 1999     4000
## 2000     3000
## 2001        0
## 2002     4000
## 2003    23000
## 2004     2400
## 2005    12000
## 2006    26000
## 2007    18000
## 2008        0
## 2009    13400
## 2010     3020
## 2011        0
## 2012     1000
## 2013       -2
## 2014     4000
## 2015    22000
## 2016    20000
## 2017        0
## 2018    30000
## 2019     7000
## 2020        0
## 2021    19000
## 2022        0
## 2023    23000
## 2024    36000
## 2025     7100
## 2026     7000
## 2027     3500
## 2028    15000
## 2029     9000
## 2030     9000
## 2031     1200
## 2032        0
## 2033    18000
## 2034     5300
## 2035      200
## 2036     7200
## 2037    19250
## 2038        0
## 2039    12370
## 2040     3500
## 2041    18000
## 2042    16500
## 2043        0
## 2044        0
## 2045    24000
## 2046     2000
## 2047     5900
## 2048        0
## 2049    19000
## 2050    26900
## 2051    27000
## 2052    26000
## 2053    15443
## 2054    10000
## 2055    11000
## 2056     4700
## 2057     3500
## 2058    13000
## 2059    12000
## 2060     2500
## 2061    11000
## 2062    20000
## 2063     4000
## 2064     1000
## 2065    10000
## 2066    18000
## 2067    26000
## 2068    10000
## 2069    10000
## 2070     9000
## 2071    10800
## 2072    11000
## 2073    20000
## 2074    30000
## 2075    20000
## 2076    17000
## 2077    19200
## 2078    18000
## 2079        0
## 2080    30000
## 2081     3000
## 2082        0
## 2083    13000
## 2084    11000
## 2085    17500
## 2086    17000
## 2087    19000
## 2088     2000
## 2089    13000
## 2090    16000
## 2091      400
## 2092    27600
## 2093        0
## 2094        0
## 2095    18500
## 2096     5000
## 2097     8000
## 2098        0
## 2099    25000
## 2100      500
## 2101    21000
## 2102     8000
## 2103        0
## 2104    24000
## 2105     4000
## 2106     9000
## 2107    22000
## 2108    24000
## 2109        0
## 2110    12000
## 2111    18000
## 2112    26000
## 2113     8000
## 2114     4800
## 2115     3500
## 2116    25000
## 2117        0
## 2118     3020
## 2119    20000
## 2120    14000
## 2121    13000
## 2122        0
## 2123    21000
## 2124      300
## 2125        0
## 2126    15000
## 2127     6000
## 2128    11000
## 2129    15700
## 2130    19000
## 2131    20000
## 2132        0
## 2133    12593
## 2134     6500
## 2135     5000
## 2136     3000
## 2137    21000
## 2138    15000
## 2139    59387
## 2140    20000
## 2141    19900
## 2142    12000
## 2143        0
## 2144     5500
## 2145    30000
## 2146    18000
## 2147    28000
## 2148    24500
## 2149    23000
## 2150      700
## 2151    16000
## 2152     8000
## 2153    14000
## 2154     9300
## 2155    20400
## 2156     1200
## 2157        0
## 2158     8500
## 2159     5000
## 2160     2000
## 2161     9000
## 2162     7919
## 2163    16000
## 2164     9000
## 2165    14000
## 2166     3041
## 2167        0
## 2168    10000
## 2169    24000
## 2170      900
## 2171        0
## 2172     6000
## 2173        0
## 2174     6667
## 2175        0
## 2176    20000
## 2177     4434
## 2178     9900
## 2179      520
## 2180       -2
## 2181     8000
## 2182    17000
## 2183    13000
## 2184    14000
## 2185     1200
## 2186    10500
## 2187        0
## 2188    25170
## 2189        0
## 2190    10000
## 2191       -2
## 2192     8000
## 2193    12000
## 2194    15000
## 2195    10000
## 2196    59387
## 2197        0
## 2198     8000
## 2199        0
## 2200    12000
## 2201    14200
## 2202    28000
## 2203    18000
## 2204    17000
## 2205        0
## 2206     8000
## 2207    14000
## 2208    11000
## 2209     8000
## 2210    10000
## 2211    10800
## 2212    17500
## 2213    21000
## 2214     8100
## 2215    17900
## 2216    14000
## 2217    11000
## 2218     7000
## 2219    16800
## 2220    14500
## 2221     1300
## 2222    20000
## 2223    18000
## 2224    14000
## 2225     8000
## 2226    12000
## 2227     9000
## 2228     7000
## 2229     7500
## 2230     7000
## 2231     2500
## 2232    10000
## 2233    20000
## 2234    15500
## 2235     2200
## 2236    38000
## 2237    12000
## 2238    16500
## 2239    10000
## 2240    30000
## 2241        0
## 2242     5300
## 2243    30000
## 2244        0
## 2245    19000
## 2246    13600
## 2247    24000
## 2248        0
## 2249    25000
## 2250    23000
## 2251     8000
## 2252    18000
## 2253     9600
## 2254        0
## 2255    19000
## 2256    25000
## 2257    59387
## 2258    18000
## 2259     9400
## 2260    19000
## 2261    13000
## 2262    17100
## 2263     6000
## 2264    13500
## 2265     8400
## 2266        0
## 2267    22000
## 2268    19000
## 2269        0
## 2270     7000
## 2271     5800
## 2272    18000
## 2273        0
## 2274    13000
## 2275        0
## 2276     7200
## 2277    38300
## 2278    15000
## 2279    37000
## 2280    18000
## 2281    22000
## 2282        0
## 2283     1340
## 2284     5400
## 2285    29000
## 2286    10000
## 2287     9000
## 2288        0
## 2289        0
## 2290    18000
## 2291        0
## 2292    10000
## 2293    17000
## 2294    18000
## 2295     9500
## 2296        0
## 2297        0
## 2298     6538
## 2299    14400
## 2300        0
## 2301     9200
## 2302     8000
## 2303    19000
## 2304        0
## 2305     5000
## 2306    15000
## 2307    11000
## 2308     2000
## 2309    22800
## 2310     4000
## 2311    11000
## 2312        0
## 2313    16000
## 2314    12000
## 2315     7000
## 2316    18000
## 2317    12000
## 2318    15200
## 2319    22500
## 2320        0
## 2321    12000
## 2322     7000
## 2323       -2
## 2324     8000
## 2325    27800
## 2326    10500
## 2327        0
## 2328    13872
## 2329     3500
## 2330    59387
## 2331    15000
## 2332    14000
## 2333    16000
## 2334    14000
## 2335     4000
## 2336     5700
## 2337     3000
## 2338    17000
## 2339        0
## 2340     4000
## 2341     4000
## 2342    31000
## 2343    12000
## 2344        0
## 2345     3000
## 2346        0
## 2347        0
## 2348    10000
## 2349    30000
## 2350    13000
## 2351    24000
## 2352     7000
## 2353    29565
## 2354    25800
## 2355    25000
## 2356    10000
## 2357     6000
## 2358     8000
## 2359     1000
## 2360    32000
## 2361    23000
## 2362     2000
## 2363        0
## 2364     4320
## 2365     7000
## 2366    13000
## 2367        0
## 2368     6000
## 2369    14000
## 2370    20000
## 2371    30000
## 2372     3000
## 2373    23000
## 2374    14000
## 2375    16400
## 2376     5000
## 2377    23000
## 2378      175
## 2379    37251
## 2380     7800
## 2381    36700
## 2382    19000
## 2383    31000
## 2384    21000
## 2385    17000
## 2386    14000
## 2387        0
## 2388     6000
## 2389    15000
## 2390        0
## 2391    15000
## 2392    15000
## 2393        0
## 2394     3500
## 2395    11000
## 2396    23000
## 2397    26959
## 2398      957
## 2399     9000
## 2400        0
## 2401    14871
## 2402     5000
## 2403    23000
## 2404    12000
## 2405    12500
## 2406        0
## 2407    13500
## 2408    14800
## 2409    21000
## 2410     6000
## 2411    14000
## 2412    14000
## 2413    16000
## 2414     2000
## 2415    24000
## 2416    20000
## 2417     6000
## 2418     5000
## 2419    12000
## 2420    20000
## 2421    21000
## 2422     9000
## 2423    10000
## 2424     3000
## 2425    10000
## 2426    17000
## 2427     8000
## 2428     1000
## 2429     9000
## 2430    14000
## 2431     5000
data.esteem.pca <- data.esteem %>%
  select(Gender, Education05, Job05,MotherEd, FatherEd, FamilyIncome78) %>%
  mutate(data.esteem, log_income = log(Income87+2)) %>%
  mutate(data.esteem, bmi = Weight05/(HeightFeet05^2))

data.esteem.pca <- data.esteem.pca %>%
  select(Gender, Education05, Job05,MotherEd, FatherEd, FamilyIncome78, bmi, log_income, Esteem87_1, Esteem87_2, Esteem87_3, Esteem87_4, Esteem87_5, Esteem87_6, Esteem87_7, Esteem87_8, Esteem87_9, Esteem87_10)

data.esteem.pca <- data.esteem.pca %>%
  mutate(pc1_esteem = 0.324*Esteem87_1 + 0.333*Esteem87_2 + 0.322*Esteem87_3 + 0.324*Esteem87_4 + 0.315*Esteem87_5 + 0.347*Esteem87_6 + 0.315*Esteem87_7 + 0.280*Esteem87_8 + 0.277*Esteem87_9 + 0.318*Esteem87_10)

data.esteem.pca <- data.esteem.pca %>%
  mutate(log_income = log_income + 0.00000001)
b)   Run a few regression models between PC1 of all the esteem scores and suitable variables listed in a). Find a final best model with your own criterion. 

  - How did you land this model? Run a model diagnosis to see if the linear model assumptions are reasonably met. 
    
  - Write a summary of your findings. In particular, explain what and how the variables in the model affect one's self-esteem. 
    
rgr.data <- na.omit(data.esteem.pca)
rgr.data <- rgr.data %>% 
  filter(!is.infinite(rgr.data$log_income))

model1 <- lm(pc1_esteem ~ Gender+ Education05+Job05+MotherEd+FatherEd+FamilyIncome78+bmi+log_income, rgr.data)
summary(model1)
## 
## Call:
## lm(formula = pc1_esteem ~ Gender + Education05 + Job05 + MotherEd + 
##     FatherEd + FamilyIncome78 + bmi + log_income, data = rgr.data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.176 -0.944 -0.005  0.993  2.865 
## 
## Coefficients:
##                                                                                           Estimate
## (Intercept)                                                                               8.74e+00
## Gendermale                                                                                1.65e-01
## Education05                                                                               8.03e-02
## Job0510 TO 430: Executive, Administrative and Managerial Occupations                      4.53e-01
## Job051000 TO 1240: Mathematical and Computer Scientists                                   5.05e-01
## Job051300 TO 1560: Engineers, Architects, Surveyers, Engineering and Related Technicians  2.32e-02
## Job051600 TO 1760: Physical Scientists                                                   -7.90e-01
## Job051800 TO 1860: Social Scientists and Related Workers                                 -3.07e-01
## Job051900 TO 1960: Life, Physical and Social Science Technicians                          2.75e-01
## Job052000 TO 2060: Counselors, Sociala and Religious Workers                              1.56e-01
## Job052100 TO 2150: Lawyers, Judges and Legal Support Workers                              2.49e-01
## Job052200 TO 2340: Teachers                                                               1.92e-01
## Job052400 TO 2550: Education, Training and Library Workers                                1.96e-01
## Job052600 TO 2760: Entertainers and Performers, Sports and Related Workers                7.62e-01
## Job052800 TO 2960: Media and Communications Workers                                       3.67e-01
## Job053000 TO 3260: Health Diagnosing and Treating Practitioners                           4.63e-01
## Job053300 TO 3650: Health Care Technical and Support Occupations                         -1.13e-01
## Job053700 TO 3950: Protective Service Occupations                                         5.67e-01
## Job054000 TO 4160: Food Preparation and Serving Related Occupations                      -1.43e-01
## Job054200 TO 4250: Cleaning and Building Service Occupations                             -2.59e-01
## Job054300 TO 4430: Entertainment Attendants and Related Workers                          -6.97e-01
## Job054500 TO 4650: Personal Care and Service Workers                                      2.71e-01
## Job054700 TO 4960: Sales and Related Workers                                              2.84e-01
## Job05500 TO 950: Management Related Occupations                                           5.27e-01
## Job055000 TO 5930: Office and Administrative Support Workers                              2.97e-01
## Job056000 TO 6130: Farming, Fishing and Forestry Occupations                             -1.75e-01
## Job056200 TO 6940: Construction Trade and Extraction Workers                              2.88e-02
## Job057000 TO 7620: Installation, Maintenance and Repairs Workers                          1.42e-01
## Job057700 TO 7750: Production and Operating Workers                                       1.87e-01
## Job057800 TO 7850: Food Preparation Occupations                                           1.89e-01
## Job057900 TO 8960: Setters, Operators and Tenders                                         1.39e-01
## Job059000 TO 9750: Transportation and Material Moving Workers                            -1.13e-01
## Job059990: Uncodeable                                                                    -5.12e-02
## MotherEd                                                                                  2.81e-02
## FatherEd                                                                                  1.03e-02
## FamilyIncome78                                                                            2.94e-06
## bmi                                                                                      -1.42e-02
## log_income                                                                                2.37e-02
##                                                                                          Std. Error
## (Intercept)                                                                                2.66e-01
## Gendermale                                                                                 5.91e-02
## Education05                                                                                1.28e-02
## Job0510 TO 430: Executive, Administrative and Managerial Occupations                       1.73e-01
## Job051000 TO 1240: Mathematical and Computer Scientists                                    2.22e-01
## Job051300 TO 1560: Engineers, Architects, Surveyers, Engineering and Related Technicians   2.33e-01
## Job051600 TO 1760: Physical Scientists                                                     6.24e-01
## Job051800 TO 1860: Social Scientists and Related Workers                                   5.18e-01
## Job051900 TO 1960: Life, Physical and Social Science Technicians                           4.82e-01
## Job052000 TO 2060: Counselors, Sociala and Religious Workers                               2.51e-01
## Job052100 TO 2150: Lawyers, Judges and Legal Support Workers                               3.52e-01
## Job052200 TO 2340: Teachers                                                                2.00e-01
## Job052400 TO 2550: Education, Training and Library Workers                                 2.76e-01
## Job052600 TO 2760: Entertainers and Performers, Sports and Related Workers                 2.95e-01
## Job052800 TO 2960: Media and Communications Workers                                        3.71e-01
## Job053000 TO 3260: Health Diagnosing and Treating Practitioners                            2.16e-01
## Job053300 TO 3650: Health Care Technical and Support Occupations                           2.02e-01
## Job053700 TO 3950: Protective Service Occupations                                          2.30e-01
## Job054000 TO 4160: Food Preparation and Serving Related Occupations                        2.18e-01
## Job054200 TO 4250: Cleaning and Building Service Occupations                               2.20e-01
## Job054300 TO 4430: Entertainment Attendants and Related Workers                            4.14e-01
## Job054500 TO 4650: Personal Care and Service Workers                                       2.46e-01
## Job054700 TO 4960: Sales and Related Workers                                               1.82e-01
## Job05500 TO 950: Management Related Occupations                                            1.99e-01
## Job055000 TO 5930: Office and Administrative Support Workers                               1.73e-01
## Job056000 TO 6130: Farming, Fishing and Forestry Occupations                               4.34e-01
## Job056200 TO 6940: Construction Trade and Extraction Workers                               1.95e-01
## Job057000 TO 7620: Installation, Maintenance and Repairs Workers                           2.01e-01
## Job057700 TO 7750: Production and Operating Workers                                        2.38e-01
## Job057800 TO 7850: Food Preparation Occupations                                            6.23e-01
## Job057900 TO 8960: Setters, Operators and Tenders                                          1.99e-01
## Job059000 TO 9750: Transportation and Material Moving Workers                              1.98e-01
## Job059990: Uncodeable                                                                      1.21e+00
## MotherEd                                                                                   1.25e-02
## FatherEd                                                                                   9.22e-03
## FamilyIncome78                                                                             1.94e-06
## bmi                                                                                        1.40e-02
## log_income                                                                                 8.59e-03
##                                                                                          t value
## (Intercept)                                                                                32.87
## Gendermale                                                                                  2.79
## Education05                                                                                 6.25
## Job0510 TO 430: Executive, Administrative and Managerial Occupations                        2.62
## Job051000 TO 1240: Mathematical and Computer Scientists                                     2.27
## Job051300 TO 1560: Engineers, Architects, Surveyers, Engineering and Related Technicians    0.10
## Job051600 TO 1760: Physical Scientists                                                     -1.27
## Job051800 TO 1860: Social Scientists and Related Workers                                   -0.59
## Job051900 TO 1960: Life, Physical and Social Science Technicians                            0.57
## Job052000 TO 2060: Counselors, Sociala and Religious Workers                                0.62
## Job052100 TO 2150: Lawyers, Judges and Legal Support Workers                                0.71
## Job052200 TO 2340: Teachers                                                                 0.96
## Job052400 TO 2550: Education, Training and Library Workers                                  0.71
## Job052600 TO 2760: Entertainers and Performers, Sports and Related Workers                  2.59
## Job052800 TO 2960: Media and Communications Workers                                         0.99
## Job053000 TO 3260: Health Diagnosing and Treating Practitioners                             2.14
## Job053300 TO 3650: Health Care Technical and Support Occupations                           -0.56
## Job053700 TO 3950: Protective Service Occupations                                           2.46
## Job054000 TO 4160: Food Preparation and Serving Related Occupations                        -0.65
## Job054200 TO 4250: Cleaning and Building Service Occupations                               -1.18
## Job054300 TO 4430: Entertainment Attendants and Related Workers                            -1.68
## Job054500 TO 4650: Personal Care and Service Workers                                        1.10
## Job054700 TO 4960: Sales and Related Workers                                                1.56
## Job05500 TO 950: Management Related Occupations                                             2.64
## Job055000 TO 5930: Office and Administrative Support Workers                                1.71
## Job056000 TO 6130: Farming, Fishing and Forestry Occupations                               -0.40
## Job056200 TO 6940: Construction Trade and Extraction Workers                                0.15
## Job057000 TO 7620: Installation, Maintenance and Repairs Workers                            0.70
## Job057700 TO 7750: Production and Operating Workers                                         0.79
## Job057800 TO 7850: Food Preparation Occupations                                             0.30
## Job057900 TO 8960: Setters, Operators and Tenders                                           0.70
## Job059000 TO 9750: Transportation and Material Moving Workers                              -0.57
## Job059990: Uncodeable                                                                      -0.04
## MotherEd                                                                                    2.25
## FatherEd                                                                                    1.11
## FamilyIncome78                                                                              1.52
## bmi                                                                                        -1.01
## log_income                                                                                  2.76
##                                                                                          Pr(>|t|)
## (Intercept)                                                                               < 2e-16
## Gendermale                                                                                 0.0053
## Education05                                                                               4.9e-10
## Job0510 TO 430: Executive, Administrative and Managerial Occupations                       0.0090
## Job051000 TO 1240: Mathematical and Computer Scientists                                    0.0232
## Job051300 TO 1560: Engineers, Architects, Surveyers, Engineering and Related Technicians   0.9209
## Job051600 TO 1760: Physical Scientists                                                     0.2059
## Job051800 TO 1860: Social Scientists and Related Workers                                   0.5529
## Job051900 TO 1960: Life, Physical and Social Science Technicians                           0.5691
## Job052000 TO 2060: Counselors, Sociala and Religious Workers                               0.5338
## Job052100 TO 2150: Lawyers, Judges and Legal Support Workers                               0.4792
## Job052200 TO 2340: Teachers                                                                0.3392
## Job052400 TO 2550: Education, Training and Library Workers                                 0.4768
## Job052600 TO 2760: Entertainers and Performers, Sports and Related Workers                 0.0097
## Job052800 TO 2960: Media and Communications Workers                                        0.3228
## Job053000 TO 3260: Health Diagnosing and Treating Practitioners                            0.0323
## Job053300 TO 3650: Health Care Technical and Support Occupations                           0.5759
## Job053700 TO 3950: Protective Service Occupations                                          0.0140
## Job054000 TO 4160: Food Preparation and Serving Related Occupations                        0.5127
## Job054200 TO 4250: Cleaning and Building Service Occupations                               0.2379
## Job054300 TO 4430: Entertainment Attendants and Related Workers                            0.0926
## Job054500 TO 4650: Personal Care and Service Workers                                       0.2699
## Job054700 TO 4960: Sales and Related Workers                                               0.1190
## Job05500 TO 950: Management Related Occupations                                            0.0082
## Job055000 TO 5930: Office and Administrative Support Workers                               0.0866
## Job056000 TO 6130: Farming, Fishing and Forestry Occupations                               0.6864
## Job056200 TO 6940: Construction Trade and Extraction Workers                               0.8830
## Job057000 TO 7620: Installation, Maintenance and Repairs Workers                           0.4816
## Job057700 TO 7750: Production and Operating Workers                                        0.4325
## Job057800 TO 7850: Food Preparation Occupations                                            0.7612
## Job057900 TO 8960: Setters, Operators and Tenders                                          0.4839
## Job059000 TO 9750: Transportation and Material Moving Workers                              0.5667
## Job059990: Uncodeable                                                                      0.9664
## MotherEd                                                                                   0.0242
## FatherEd                                                                                   0.2660
## FamilyIncome78                                                                             0.1290
## bmi                                                                                        0.3106
## log_income                                                                                 0.0058
##                                                                                             
## (Intercept)                                                                              ***
## Gendermale                                                                               ** 
## Education05                                                                              ***
## Job0510 TO 430: Executive, Administrative and Managerial Occupations                     ** 
## Job051000 TO 1240: Mathematical and Computer Scientists                                  *  
## Job051300 TO 1560: Engineers, Architects, Surveyers, Engineering and Related Technicians    
## Job051600 TO 1760: Physical Scientists                                                      
## Job051800 TO 1860: Social Scientists and Related Workers                                    
## Job051900 TO 1960: Life, Physical and Social Science Technicians                            
## Job052000 TO 2060: Counselors, Sociala and Religious Workers                                
## Job052100 TO 2150: Lawyers, Judges and Legal Support Workers                                
## Job052200 TO 2340: Teachers                                                                 
## Job052400 TO 2550: Education, Training and Library Workers                                  
## Job052600 TO 2760: Entertainers and Performers, Sports and Related Workers               ** 
## Job052800 TO 2960: Media and Communications Workers                                         
## Job053000 TO 3260: Health Diagnosing and Treating Practitioners                          *  
## Job053300 TO 3650: Health Care Technical and Support Occupations                            
## Job053700 TO 3950: Protective Service Occupations                                        *  
## Job054000 TO 4160: Food Preparation and Serving Related Occupations                         
## Job054200 TO 4250: Cleaning and Building Service Occupations                                
## Job054300 TO 4430: Entertainment Attendants and Related Workers                          .  
## Job054500 TO 4650: Personal Care and Service Workers                                        
## Job054700 TO 4960: Sales and Related Workers                                                
## Job05500 TO 950: Management Related Occupations                                          ** 
## Job055000 TO 5930: Office and Administrative Support Workers                             .  
## Job056000 TO 6130: Farming, Fishing and Forestry Occupations                                
## Job056200 TO 6940: Construction Trade and Extraction Workers                                
## Job057000 TO 7620: Installation, Maintenance and Repairs Workers                            
## Job057700 TO 7750: Production and Operating Workers                                         
## Job057800 TO 7850: Food Preparation Occupations                                             
## Job057900 TO 8960: Setters, Operators and Tenders                                           
## Job059000 TO 9750: Transportation and Material Moving Workers                               
## Job059990: Uncodeable                                                                       
## MotherEd                                                                                 *  
## FatherEd                                                                                    
## FamilyIncome78                                                                              
## bmi                                                                                         
## log_income                                                                               ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 2375 degrees of freedom
## Multiple R-squared:  0.118,  Adjusted R-squared:  0.104 
## F-statistic: 8.58 on 37 and 2375 DF,  p-value: <2e-16
model2 <- lm(pc1_esteem ~ Gender+ Education05+MotherEd+log_income, rgr.data)
summary(model2)
## 
## Call:
## lm(formula = pc1_esteem ~ Gender + Education05 + MotherEd + log_income, 
##     data = rgr.data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.530 -0.991 -0.005  1.013  2.605 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.40594    0.16194   51.91  < 2e-16 ***
## Gendermale   0.12546    0.05011    2.50  0.01235 *  
## Education05  0.10840    0.01094    9.91  < 2e-16 ***
## MotherEd     0.04592    0.01049    4.38  1.3e-05 ***
## log_income   0.03169    0.00856    3.70  0.00022 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.22 on 2408 degrees of freedom
## Multiple R-squared:  0.0842, Adjusted R-squared:  0.0827 
## F-statistic: 55.4 on 4 and 2408 DF,  p-value: <2e-16

2 Case study 2: Breast cancer sub-type

The Cancer Genome Atlas (TCGA), a landmark cancer genomics program by National Cancer Institute (NCI), molecularly characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer types. The genome data is open to public from the Genomic Data Commons Data Portal (GDC).

In this study, we focus on 4 sub-types of breast cancer (BRCA): basal-like (basal), Luminal A-like (lumA), Luminal B-like (lumB), HER2-enriched. The sub-type is based on PAM50, a clinical-grade luminal-basal classifier.

  • Luminal A cancers are low-grade, tend to grow slowly and have the best prognosis.
  • Luminal B cancers generally grow slightly faster than luminal A cancers and their prognosis is slightly worse.
  • HER2-enriched cancers tend to grow faster than luminal cancers and can have a worse prognosis, but they are often successfully treated with targeted therapies aimed at the HER2 protein.
  • Basal-like breast cancers or triple negative breast cancers do not have the three receptors that the other sub-types have so have fewer treatment options.

We will try to use mRNA expression data alone without the labels to classify 4 sub-types. Classification without labels or prediction without outcomes is called unsupervised learning. We will use K-means and spectrum clustering to cluster the mRNA data and see whether the sub-type can be separated through mRNA data.

We first read the data using data.table::fread() which is a faster way to read in big data than read.csv().

  1. Summary and transformation

    1. How many patients are there in each sub-type?
brca <- fread("data/brca_subtype.csv")

# get the sub-type information
brca_subtype <- brca$BRCA_Subtype_PAM50
brca <- brca[,-1]

table(brca_subtype)
## brca_subtype
## Basal  Her2  LumA  LumB 
##   208    91   628   233
b) Randomly pick 5 genes and plot the histogram by each sub-type.
set.seed(10)
brca_sample_idx <- sample(ncol(brca), 5)
basal_indices <- which(brca_subtype == "Basal")
her2_indices <- which(brca_subtype == "Her2")
luma_indices <- which(brca_subtype == "LumA")
lumb_indices <- which(brca_subtype == "LumB")
brca[basal_indices,] %>% 
  select(all_of(brca_sample_idx)) %>%      # select column by index
  pivot_longer(cols = everything()) %>%     # for facet(0)
  ggplot(aes(x = value, y = ..density..)) +
  geom_histogram(aes(fill = name)) +
  facet_wrap(~name, scales = "free") +
  theme_bw() +
  theme(legend.position = "none")

brca[her2_indices,] %>% 
  select(all_of(brca_sample_idx)) %>%      # select column by index
  pivot_longer(cols = everything()) %>%     # for facet(0)
  ggplot(aes(x = value, y = ..density..)) +
  geom_histogram(aes(fill = name)) +
  facet_wrap(~name, scales = "free") +
  theme_bw() +
  theme(legend.position = "none")

brca[luma_indices,] %>% 
  select(all_of(brca_sample_idx)) %>%      # select column by index
  pivot_longer(cols = everything()) %>%     # for facet(0)
  ggplot(aes(x = value, y = ..density..)) +
  geom_histogram(aes(fill = name)) +
  facet_wrap(~name, scales = "free") +
  theme_bw() +
  theme(legend.position = "none")

brca[lumb_indices,] %>% 
  select(all_of(brca_sample_idx)) %>%      # select column by index
  pivot_longer(cols = everything()) %>%     # for facet(0)
  ggplot(aes(x = value, y = ..density..)) +
  geom_histogram(aes(fill = name)) +
  facet_wrap(~name, scales = "free") +
  theme_bw() +
  theme(legend.position = "none")

c) Remove gene with zero count and no variability. Then apply logarithmic transform.
require(caret)
dim(brca)
## [1]  1160 19947
# remove genes with 0 counts
sel_cols <- which(colSums(abs(brca)) != 0)
brca_sub <- brca[, sel_cols, with=F]
dim(brca_sub)
## [1]  1160 19669
# remove genes with no variability (SD=0)
# after removing 0 counts, there are no genes/cols with all same values
# brca_sub[,-nearZeroVar(brca_sub)]
dim(brca_sub)
## [1]  1160 19669
# log
brca_sub <- log2(as.matrix(brca_sub+1e-10))
  1. Apply kmeans on the transformed dataset with 4 centers and output the discrepancy table between the real sub-type brca_subtype and the cluster labels.
system.time({brca_sub_kmeans <- kmeans(x = brca_sub, 4)})  
##    user  system elapsed 
##  13.195   0.268  13.585
# save the results as RDS
saveRDS(brca_sub_kmeans, "data/brca_kmeans.RDS")

# read in tcga_sub_kmeans
brca_sub_kmeans <- readRDS("data/brca_kmeans.RDS")

# discrepancy table 
table(brca_subtype, brca_sub_kmeans$cluster)
##             
## brca_subtype   1   2   3   4
##        Basal   1  17   3 187
##        Her2   39   9  26  17
##        LumA  392  82 154   0
##        LumB   98  22 111   2
  1. Spectrum clustering: to scale or not to scale?

    1. Apply PCA on the centered and scaled dataset. How many PCs should we use and why? You are encouraged to use irlba::irlba().
require("irlba")
# center and scale the data
brca_sub_scaled_centered <- scale(as.matrix(brca_sub), center = T, scale = T)
svd_ret <- irlba::irlba(brca_sub_scaled_centered, nv = 10)
names(svd_ret)
## [1] "d"     "u"     "v"     "iter"  "mprod"
# Approximate the PVE
svd_var <- svd_ret$d^2/(nrow(brca_sub_scaled_centered)-1)
pve_apx <- svd_var/ncol(brca)
plot(pve_apx, type="b", pch = 19, frame = FALSE)

We may look at the scree plot of PVE’s and apply elbow rules: take the number of PC’s when there is a sharp drop in the scree plot. We can either choose 2 or 4 PC, in this case, to capture more cumnulative explained variance, we choose 4 PC.

b) Plot PC1 vs PC2 of the centered and scaled data and PC1 vs PC2 of the centered but unscaled data side by side. Should we scale or not scale for clustering propose? Why? (Hint: to put plots side by side, use `gridExtra::grid.arrange()` or `ggpubr::ggrrange()` or `egg::ggrrange()` for ggplots; use `fig.show="hold"` as chunk option for base plots)
require("gridExtra")
require("grid")

# get pc score
pc_score <- brca_sub_scaled_centered %*% svd_ret$v[, 1:3]

# apply kmean
kmean_ret <- kmeans(x = pc_score, 4)

p1 <- data.table(x = pc_score[,1], 
                y = pc_score[,2],
                col = as.factor(brca_subtype),
                cl = as.factor(kmean_ret$cluster)) %>%
  ggplot() + 
  geom_point(aes(x = x, y = y, col = col, shape = cl)) +
  scale_color_manual(labels = c("Basal", "Her2", "LumA", "LumB"),
                     values = scales::hue_pal()(4)) +
  scale_shape_manual(labels = c("A", "B", "C", "D"),
                     values = c(1, 2, 3, 4)) + 
  theme_bw() +
  labs(color = "Cancer type", shape = "Cluster") +
  xlab("PC1") +
  ylab("PC2") +
  ggtitle("Centered and Scaled")

brca_sub_centered <- scale(as.matrix(brca_sub), center = T, scale = F)
svd_ret <- irlba::irlba(brca_sub_centered, nv = 10)

# Approximate the PVE
svd_var <- svd_ret$d^2/(nrow(brca_sub_centered)-1)
pve_apx <- svd_var/ncol(brca) # plot also shows 4 PC by elbow method

pc_score <- brca_sub_centered %*% svd_ret$v[, 1:3]

# apply kmean
kmean_ret <- kmeans(x = pc_score, 4)

p2 <- data.table(x = pc_score[,1], 
                y = pc_score[,2],
                col = as.factor(brca_subtype),
                cl = as.factor(kmean_ret$cluster)) %>%
  ggplot() + 
  geom_point(aes(x = x, y = y, col = col, shape = cl)) +
  scale_color_manual(labels = c("Basal", "Her2", "LumA", "LumB"),
                     values = scales::hue_pal()(4)) +
  scale_shape_manual(labels = c("A", "B", "C", "D"),
                     values = c(1, 2, 3, 4)) + 
  theme_bw() +
  labs(color = "Cancer type", shape = "Cluster") +
  xlab("PC1") +
  ylab("PC2") + 
  ggtitle("Centered")

grid.arrange(p1, p2, nrow = 1)

In our case, scaling does not seem to be necessary, since both plots maintain similar cluster shapes. However, in general, clusterings is scale-sensitive when dealing with features of different units, because the Euclidean distance algorithm will weigh variables with higher numbers more. Since in our case, there is only one data type, scaling is not that necessary.

  1. Spectrum clustering: center but do not scale the data

    1. Use the first 4 PCs of the centered and unscaled data and apply kmeans. Find a reasonable number of clusters using within sum of squared with the elbow rule.
k.values <- 1:10

# function to compute total within-cluster sum of square 
wss <- function(df, k) {
  kmeans(df, k, nstart = 10)$tot.withinss
}

# extract wss for 1:10 clusters
wss_values <- map_dbl(k.values, function(k) wss(svd_ret$v[, 1:3], k))
plot(k.values, wss_values,
       type="b", pch = 19, frame = FALSE, 
       xlab="Number of clusters K",
       ylab="Total within-clusters sum of squares")

Elbow rule agrees that a good number of clusters is 4.

b) Choose an optimal cluster number and apply kmeans. Compare the real sub-type and the clustering label as follows: Plot scatter plot of PC1 vs PC2. Use point color to indicate the true cancer type and point shape to indicate the clustering label. Plot the kmeans centroids with black dots. Summarize how good is clustering results compared to the real sub-type.
p2 <- data.table(x = pc_score[,1], 
                y = pc_score[,2],
                col = as.factor(brca_subtype),
                cl = as.factor(kmean_ret$cluster)) %>%
  ggplot() + 
  geom_point(aes(x = x, y = y, col = col, shape = cl)) +
  scale_color_manual(labels = c("Basal", "Her2", "LumA", "LumB"),
                     values = scales::hue_pal()(4)) +
  scale_shape_manual(labels = c("A", "B", "C", "D"),
                     values = c(1, 2, 3, 4)) + 
  theme_bw() +
  labs(color = "Cancer type", shape = "Cluster") +
  xlab("PC1") +
  ylab("PC2")  + 
  geom_point(aes(x=kmean_ret$centers[1,1], y=kmean_ret$centers[1,2]), colour="black") + 
  geom_point(aes(x=kmean_ret$centers[2,1], y=kmean_ret$centers[2,2]), colour="black") + 
  geom_point(aes(x=kmean_ret$centers[3,1], y=kmean_ret$centers[3,2]), colour="black") + 
  geom_point(aes(x=kmean_ret$centers[4,1], y=kmean_ret$centers[4,2]), colour="black")
p2

Clustering result compared to the real sub-type is good for one of the sub-types, but the other three are very hard to distinguish.

c) Compare the clustering result from applying kmeans to the original data and the clustering result from applying kmeans to 4 PCs. Does PCA help in kmeans clustering? What might be the reasons if PCA helps?
table(brca_subtype, kmean_ret$cluster)
##             
## brca_subtype   1   2   3   4
##        Basal   3  17 187   1
##        Her2   27   9  26  29
##        LumA  161  91   0 376
##        LumB  108  22   2 101

Compared to the results from applying kmeans to the original data, PCA helps a little bit, with more distinguishability in the 4 clusters. PCA reduces the dimensionality of the data but still retains the information and variance of the data, which means that kmeans will take less time to train. Additionally, PCA helps whiten the data and normalize/scale it, and since kmeans is sensitive to these aspects, performance will be increased.

d) Now we have an x patient with breast cancer but with unknown sub-type. We have this patient's mRNA sequencing data. Project this x patient to the space of PC1 and PC2. (Hint: remember we remove some gene with no counts or no variablity, take log and centered) Plot this patient in the plot in iv) with a black dot. Calculate the Euclidean distance between this patient and each of centroid of the cluster. Can you tell which sub-type this patient might have? 
#unscaled data
pca_unscaled <- prcomp(brca_sub, center = T, scale. = F)
pca_unscaled$rotation<- pca_unscaled$rotation[, 1:20]
pca_unscaled$x <- pca_unscaled$x[, 1:20]
pve_unscaled <- summary(pca_unscaled)$importance[2, 1:10]

kmeans_unscaled <- kmeans(x = pca_unscaled$x[,1:4], 4) 

df_unscaled<-(cbind.data.frame(PC1=pca_unscaled$x[,1],
                                PC2=pca_unscaled$x[,2],
                                brca_subtype,
                                cluster=as.factor(kmeans_unscaled$cluster)))

x_patient <- fread("data/brca_x_patient.csv")

dim(x_patient)
## [1]     1 19947
# remove genes with 0 counts
sel_cols <- which(colSums(abs(x_patient)) != 0)
x_patient_sub <- x_patient[, sel_cols, with=F]
dim(x_patient_sub)
## [1]     1 17193
# log
x_patient_sub <- log2(as.matrix(x_patient_sub+1e-10))

x_patient_sub <- scale(as.matrix(x_patient_sub),center=T,scale=F)

pc1_loadings<-as.data.frame(pca_unscaled$rotation[,1])
pc2_loadings<-as.data.frame(pca_unscaled$rotation[,2])

x_pc1<-sum(pc1_loadings*x_patient_sub)
x_pc2<-sum(pc2_loadings*x_patient_sub)
ggplot(df_unscaled, aes(x=PC1,y=PC2,col=brca_subtype,shape=cluster))+
         geom_point()+
  geom_point(x=x_pc1,y=x_pc2,size=10)

table(brca_subtype, kmeans_unscaled$cluster)
##             
## brca_subtype   1   2   3   4
##        Basal 187   1   3  17
##        Her2   28  27  27   9
##        LumA    0 376 161  91
##        LumB    2  99 110  22
centroid1_dist<-sqrt((kmeans_unscaled$centers[1,1]-x_pc1)^2 + (kmeans_unscaled$centers[1,2]-x_pc2)^2)
centroid2_dist<-sqrt((kmeans_unscaled$centers[2,1]-x_pc1)^2 + (kmeans_unscaled$centers[2,2]-x_pc2)^2)
centroid3_dist<-sqrt((kmeans_unscaled$centers[3,1]-x_pc1)^2 + (kmeans_unscaled$centers[3,2]-x_pc2)^2)
centroid4_dist<-sqrt((kmeans_unscaled$centers[4,1]-x_pc1)^2 + (kmeans_unscaled$centers[4,2]-x_pc2)^2)

dists<-rbind(c("cluster1 dist","cluster2 dist","cluster3 dist","cluster4 dist"),
             c(centroid1_dist,centroid2_dist,centroid3_dist,centroid4_dist))

dists
##      [,1]               [,2]               [,3]              
## [1,] "cluster1 dist"    "cluster2 dist"    "cluster3 dist"   
## [2,] "304.376612038529" "89.4657826434346" "256.553961436401"
##      [,4]              
## [1,] "cluster4 dist"   
## [2,] "290.787367635422"

The new patient is closest to cluster 2, which is most likely to be the LumA subtype.

3 Case study 3: Auto data set

This question utilizes the Auto dataset from ISLR. The original dataset contains 408 observations about cars. It is similar to the CARS dataset that we use in our lectures. To get the data, first install the package ISLR. The Auto dataset should be loaded automatically. We’ll use this dataset to practice the methods learn so far. Original data source is here: https://archive.ics.uci.edu/ml/datasets/auto+mpg

Get familiar with this dataset first. Tip: you can use the command ?ISLR::Auto to view a description of the dataset.

#load libraries
library(ggplot2)
library(GGally)

autoraw <- Auto[, c(1, 2, 3,4,5,6)]
library(ISLR)
?ISLR::Auto
dim(Auto)
## [1] 392   9
str(Auto)
## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : num  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : num  3504 3693 3436 3433 3449 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : num  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
head(Auto)
##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500
##       mpg         cylinders     displacement   horsepower        weight    
##  Min.   : 9.0   Min.   :3.00   Min.   : 68   Min.   : 46.0   Min.   :1613  
##  1st Qu.:17.0   1st Qu.:4.00   1st Qu.:105   1st Qu.: 75.0   1st Qu.:2225  
##  Median :22.8   Median :4.00   Median :151   Median : 93.5   Median :2804  
##  Mean   :23.4   Mean   :5.47   Mean   :194   Mean   :104.5   Mean   :2978  
##  3rd Qu.:29.0   3rd Qu.:8.00   3rd Qu.:276   3rd Qu.:126.0   3rd Qu.:3615  
##  Max.   :46.6   Max.   :8.00   Max.   :455   Max.   :230.0   Max.   :5140  
##                                                                            
##   acceleration       year        origin                     name    
##  Min.   : 8.0   Min.   :70   Min.   :1.00   amc matador       :  5  
##  1st Qu.:13.8   1st Qu.:73   1st Qu.:1.00   ford pinto        :  5  
##  Median :15.5   Median :76   Median :1.00   toyota corolla    :  5  
##  Mean   :15.5   Mean   :76   Mean   :1.58   amc gremlin       :  4  
##  3rd Qu.:17.0   3rd Qu.:79   3rd Qu.:2.00   amc hornet        :  4  
##  Max.   :24.8   Max.   :82   Max.   :3.00   chevrolet chevette:  4  
##                                             (Other)           :365

3.1 EDA

Explore the data, with particular focus on pairwise plots and summary statistics. Briefly summarize your findings and any peculiarities in the data.

pairs(Auto)

Auto$originf <- as.factor(Auto$origin)
Auto$yearf <- as.factor(Auto$year)
Auto$cylf <- as.factor(Auto$cylinders)


#Auto["originf"][Auto["originf"] == 1] <- "American"
#Auto["originf"][Auto["originf"] == 2] <- "European"
#Auto["originf"][Auto["originf"] == 3] <- "Japanese"


Auto %>% ggplot() + aes(x = weight, y = mpg, col = originf) + geom_point() +
geom_smooth(method='lm', formula= y~x)

Auto %>% ggplot() + aes(x = year, y = mpg, col = originf) + geom_point() +
geom_smooth(method='lm', formula= y~x)

Auto %>% ggplot() + aes(x = horsepower, y = mpg, col = originf) + geom_point() +
geom_smooth(method='lm', formula= y~x)

Auto %>%
ggplot(aes(x=weight, y=mpg, group = originf)) +
geom_point()+
geom_smooth(method="lm", formula=y~x, se=F,color = "red")+
facet_wrap(~origin) +
theme_bw()

Auto %>%
ggplot(aes(x=weight, y=mpg, group = year, col = originf)) +
geom_point()+
geom_smooth(method="lm", formula=y~x, se=F,color = "black")+
facet_wrap(~year) +
theme_bw()

Auto %>%
  ggplot(aes(x = yearf, y = mpg, fill = originf)) +
  geom_boxplot()

Auto %>%
  ggplot(aes(x = cylf, y = mpg, fill = originf)) + 
  geom_boxplot()

From the above plots, we can see numerous relationships; however later inspection shows variables to be correlated. It appears mpg increases in year, decreases in weight, displacement, cylinders, and horsepower.

Differences between the three origins are present, even when isolating other influences. Therefore, when building models, we will keep variables and work to eliminate those that are highly correlated and uninformative.

3.2 What effect does time have on MPG?

  1. Start with a simple regression of mpg vs. year and report R’s summary output. Is year a significant variable at the .05 level? State what effect year has on mpg, if any, according to this model.
fit1 <- lm(mpg ~ year, data = Auto)    
ggplot(Auto, aes(x = year , y = mpg)) + 
  geom_point() +
  geom_smooth(method="lm",se=F) + 
geom_hline(aes(yintercept = mean(mpg)), color = "red") 

summary(fit1) 
## 
## Call:
## lm(formula = mpg ~ year, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.021  -5.441  -0.441   4.974  18.209 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -70.0117     6.6452   -10.5   <2e-16 ***
## year          1.2300     0.0874    14.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.36 on 390 degrees of freedom
## Multiple R-squared:  0.337,  Adjusted R-squared:  0.335 
## F-statistic:  198 on 1 and 390 DF,  p-value: <2e-16

From fit1, the year variable has a p-value of near 0, so it is significant at the 0.05 threshold. This aligns with comparison of the average and fitted model because one shows no trend and the other shows a clear upward trend.

From the above analysis, the beta value for year was found to be 1.23, so the fit says that for a increase in year, mpg (on average), increases by 1.23mpg.

  1. Add horsepower on top of the variable year to your linear model. Is year still a significant variable at the .05 level? Give a precise interpretation of the year’s effect found here.
fit2 <- lm(mpg ~ year + horsepower, data = Auto)
summary(fit2)
## 
## Call:
## lm(formula = mpg ~ year + horsepower, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.077  -3.078  -0.431   2.588  15.315 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -12.73917    5.34903   -2.38    0.018 *  
## year          0.65727    0.06626    9.92   <2e-16 ***
## horsepower   -0.13165    0.00634  -20.76   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.39 on 389 degrees of freedom
## Multiple R-squared:  0.685,  Adjusted R-squared:  0.684 
## F-statistic:  424 on 2 and 389 DF,  p-value: <2e-16

Year remained statistically significant at the 0.05 level, but its coefficient decreased by nearly 50% after the introduction of a horsepower variable.

Now, the interpretation for the year beta coefficient is as follows: with mpg held constant, an increase in year, on average, increases mpg by 0.65.

  1. The two 95% CI’s for the coefficient of year differ among (i) and (ii). How would you explain the difference to a non-statistician?

The 95% CI for i is beta_year = 1.2300 +/- 2(0.0874) The 95% CI for ii is beta_year = 0.65727 +/- 2(0.06626)

This difference arises because of the introduction of another variable, which captures information from the horsepower data. Now, since we now predict based on horespower and year, we dont have to just guess basses on year. With more information, we can tighten our confidence interval on the year coeeficient. Without the additional information from horsepower, we had to have a wider confidence interval because we had less information (variance) to base our model on

  1. Create a model with interaction by fitting lm(mpg ~ year * horsepower). Is the interaction effect significant at .05 level? Explain the year effect (if any).
fit3 <- lm(mpg ~ year * horsepower, data = Auto)
summary(fit3)
## 
## Call:
## lm(formula = mpg ~ year * horsepower, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.349  -2.451  -0.456   2.406  14.444 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -1.27e+02   1.21e+01  -10.45   <2e-16 ***
## year             2.19e+00   1.61e-01   13.59   <2e-16 ***
## horsepower       1.05e+00   1.15e-01    9.06   <2e-16 ***
## year:horsepower -1.60e-02   1.56e-03  -10.22   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.9 on 388 degrees of freedom
## Multiple R-squared:  0.752,  Adjusted R-squared:  0.75 
## F-statistic:  393 on 3 and 388 DF,  p-value: <2e-16

The interaction term was found to be significant at the 0.05 level. The effect of year, now with the interaction term, is two-fold. First, holding horsepower constant, increasing year increases mpg by 2.19mpg. The interaction term between year and horsepower comes into play, mpg goes down by 0.016 (horsepower * change in year)

3.3 Categorical predictors

Remember that the same variable can play different roles! Take a quick look at the variable cylinders, and try to use this variable in the following analyses wisely. We all agree that a larger number of cylinders will lower mpg. However, we can interpret cylinders as either a continuous (numeric) variable or a categorical variable.

  1. Fit a model that treats cylinders as a continuous/numeric variable. Is cylinders significant at the 0.01 level? What effect does cylinders play in this model?
fit4 <- lm(mpg ~ cylinders, data = Auto)
summary(fit4)
## 
## Call:
## lm(formula = mpg ~ cylinders, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.241  -3.183  -0.633   2.549  17.917 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   42.916      0.835    51.4   <2e-16 ***
## cylinders     -3.558      0.146   -24.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.91 on 390 degrees of freedom
## Multiple R-squared:  0.605,  Adjusted R-squared:  0.604 
## F-statistic:  597 on 1 and 390 DF,  p-value: <2e-16

Treating cylinders as a continuous variable results in cylinders being significant at the 0.01 level. For each incremental cylinder, mpg decreases by 3.558, with a theoretical 0-cylinder car having 42.916 mpg. This is not easily interpreted, but does show a negative relationship.

  1. Fit a model that treats cylinders as a categorical/factor. Is cylinders significant at the .01 level? What is the effect of cylinders in this model? Describe the cylinders effect over mpg.
fit5 <- lm(mpg ~ cylf, data = Auto)
summary(fit5)
## 
## Call:
## lm(formula = mpg ~ cylf, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.284  -2.904  -0.963   2.344  18.027 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   20.550      2.349    8.75  < 2e-16 ***
## cylf4          8.734      2.373    3.68  0.00027 ***
## cylf5          6.817      3.589    1.90  0.05825 .  
## cylf6         -0.577      2.405   -0.24  0.81071    
## cylf8         -5.587      2.395   -2.33  0.02015 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.7 on 387 degrees of freedom
## Multiple R-squared:  0.641,  Adjusted R-squared:  0.638 
## F-statistic:  173 on 4 and 387 DF,  p-value: <2e-16

Now, only cylinder4 is significant, because the other categories do not reach significance beyond the 0.01 level. To interpret this, the average MPG for a three cylinder car is the intercept, with cars in each other category differing from the intercept by the coefficient of their respective category. The only coefficient that is statistically significant is for cylf4.

  1. What are the fundamental differences between treating cylinders as a continuous and categorical variable in your models?

When treating cylinders as a continuous variable, the model estimates a coefficient for each incremental change in cylinder count– meaning that the model fits a straight line.

There are two problems with this. First there are no 1,2 or 7 cylinder cars. This means that cylinders is not really a continuous variable. The second, more important issue, is that the 4th incremental cylinder and 7th/8th incremental cylinders likely have different effects.

  1. Can you test the null hypothesis: fit0: mpg is linear in cylinders vs. fit1: mpg relates to cylinders as a categorical variable at .01 level?
anova(fit4, fit5)
## Analysis of Variance Table
## 
## Model 1: mpg ~ cylinders
## Model 2: mpg ~ cylf
##   Res.Df  RSS Df Sum of Sq    F  Pr(>F)    
## 1    390 9416                              
## 2    387 8544  3       871 13.2 3.4e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since the p-value of the anova test was statistically significant (near 0), we succeed in rejecting the null hypothesis that mpg is linear in cylinders and show that it is categorical.

p-val < 0.01

3.4 Results

Final modeling question: we want to explore the effects of each feature as best as possible. You may explore interactions, feature transformations, higher order terms, or other strategies within reason. The model(s) should be as parsimonious (simple) as possible unless the gain in accuracy is significant from your point of view.

res <- cor(autoraw)
round(res, 2)
##                mpg cylinders displacement horsepower weight acceleration
## mpg           1.00     -0.78        -0.81      -0.78  -0.83         0.42
## cylinders    -0.78      1.00         0.95       0.84   0.90        -0.50
## displacement -0.81      0.95         1.00       0.90   0.93        -0.54
## horsepower   -0.78      0.84         0.90       1.00   0.86        -0.69
## weight       -0.83      0.90         0.93       0.86   1.00        -0.42
## acceleration  0.42     -0.50        -0.54      -0.69  -0.42         1.00
knitr::include_graphics("occam.png")

From the correlation matrix, we see that displacement and cylinders are strongly correlated (0.95) so the two probably don’t provide much information on-top of each other. Horsepower also appears highly correlated (0.9) with displacement, as is weight (0.93).

Building a model using the philosophy of “Occam’s razor,” “plurality should not be posited without necessity,” we will compare a model using all inputs and one that removes both insignificant ones and variables that are correlated with others.

Furthermore, to preserve interpret ability and avoid possible over-fitting, we will avoid higher-order terms and interaction terms.

model1 <- lm(mpg ~ cylf + displacement + horsepower + weight +
           acceleration + year + originf, data = Auto)
summary(model1) ### key output
## 
## Call:
## lm(formula = mpg ~ cylf + displacement + horsepower + weight + 
##     acceleration + year + originf, data = Auto)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.680 -1.937 -0.068  1.671 12.776 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2.21e+01   4.54e+00   -4.86  1.7e-06 ***
## cylf4         6.72e+00   1.65e+00    4.06  5.9e-05 ***
## cylf5         7.08e+00   2.52e+00    2.81   0.0052 ** 
## cylf6         3.35e+00   1.82e+00    1.84   0.0670 .  
## cylf8         5.10e+00   2.11e+00    2.42   0.0161 *  
## displacement  1.87e-02   7.22e-03    2.59   0.0100 ** 
## horsepower   -3.49e-02   1.32e-02   -2.64   0.0087 ** 
## weight       -5.78e-03   6.31e-04   -9.15  < 2e-16 ***
## acceleration  2.60e-02   9.30e-02    0.28   0.7802    
## year          7.37e-01   4.89e-02   15.06  < 2e-16 ***
## originf2      1.76e+00   5.51e-01    3.20   0.0015 ** 
## originf3      2.62e+00   5.27e-01    4.96  1.0e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.1 on 380 degrees of freedom
## Multiple R-squared:  0.847,  Adjusted R-squared:  0.842 
## F-statistic:  191 on 11 and 380 DF,  p-value: <2e-16
model2 <- lm(mpg ~ horsepower + weight +
           acceleration + year + originf, data = Auto)
summary(model2) ### key output
## 
## Call:
## lm(formula = mpg ~ horsepower + weight + acceleration + year + 
##     originf, data = Auto)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.492 -2.090 -0.008  1.832 13.450 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.80e+01   4.63e+00   -3.89  0.00012 ***
## horsepower   -4.75e-03   1.32e-02   -0.36  0.71869    
## weight       -5.66e-03   5.03e-04  -11.25  < 2e-16 ***
## acceleration  3.92e-02   9.77e-02    0.40  0.68857    
## year          7.55e-01   5.17e-02   14.62  < 2e-16 ***
## originf2      1.94e+00   5.21e-01    3.72  0.00023 ***
## originf3      2.27e+00   5.26e-01    4.31    2e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.34 on 385 degrees of freedom
## Multiple R-squared:  0.819,  Adjusted R-squared:  0.817 
## F-statistic:  291 on 6 and 385 DF,  p-value: <2e-16
model3 <- lm(mpg ~ weight + year + originf, data = Auto)
summary(model3) ### key output
## 
## Call:
## lm(formula = mpg ~ weight + year + originf, data = Auto)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.603 -2.113 -0.021  1.762 13.526 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -18.30694    4.01724   -4.56  7.0e-06 ***
## weight       -0.00589    0.00026  -22.65  < 2e-16 ***
## year          0.76985    0.04867   15.82  < 2e-16 ***
## originf2      1.97631    0.51797    3.82  0.00016 ***
## originf3      2.21453    0.51882    4.27  2.5e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.34 on 387 degrees of freedom
## Multiple R-squared:  0.819,  Adjusted R-squared:  0.817 
## F-statistic:  438 on 4 and 387 DF,  p-value: <2e-16
model4 <- lm(mpg ~ weight * year + originf, data = Auto)
AIC(model1, model2, model3, model4)
##        df  AIC
## model1 13 2013
## model2  8 2067
## model3  6 2064
## model4  7 2010
BIC(model1, model2, model3, model4)
##        df  BIC
## model1 13 2064
## model2  8 2099
## model3  6 2088
## model4  7 2038
  1. Describe the final model. Include diagnostic plots with particular focus on the model residuals and diagnoses.

For parsimony, we select model3 with just three variables: weight, year, and origin. From further inspection of the residual plots, there appears to be limited gain from the complexity of model1 relative to model3.

A model with an interaction term for year * weight, the story being fuel efficiencies could be non-linear in more advanced cars, especially for heavier ones, but this model, from inspection of BIC, AIC, and the plots below has limited benefit.

res3 <- resid(model3)
plot(fitted(model3), res3)
abline(0,0)

res1 <- resid(model1)
plot(fitted(model1), res1)
abline(0,0)

res4 <- resid(model4)
plot(fitted(model4), res4)
abline(0,0)

qqnorm(res3)
qqline(res3) 

qqnorm(res4)
qqline(res4) 

From the above plots, there appear to be non-normal residual distributions on the upper tail, the upper tail of the q-q plot is fat. From inspection of the plot of residuals for model3, there appears to be more variance at higher fitted values.

There appears to be relative homoskedacicity in the middle of fitted values, but with heteroskedacity at relative outlier observations at both tails.

model4, with the interaction term,

  1. Summarize the effects found.

With the selected model, model3, the effects found are as follows:

mpg is increasing with repect to year mpg decreases with additional weight

Cars made in Europe get a slight boost relative to American cars Cars made in Japan get a slight boost relative to American cars

  1. Predict the mpg of the following car: A red car built in the US in 1983 that is 180 inches long, has eight cylinders, displaces 350 cu. inches, weighs 4000 pounds, and has a horsepower of 260. Also give a 95% CI for your prediction.
newcar <-  Auto[1, ]  # Create a new row with same structure as in data1
newcar[1] <- "NA" # Assign features for the new car
newcar$origin <- 1
newcar$cylinders <- 8
newcar$displacement <- 350
newcar$horsepower <- 260
newcar$weight <- 4000
newcar$year <- 83

predict(model3,newcar,interval="prediction",se.fit=T) 
## $fit
##   fit  lwr  upr
## 1  22 15.4 28.7
## 
## $se.fit
## [1] 0.484
## 
## $df
## [1] 387
## 
## $residual.scale
## [1] 3.34
predict(model3,newcar,interval="confidence",se.fit=T) 
## $fit
##   fit  lwr upr
## 1  22 21.1  23
## 
## $se.fit
## [1] 0.484
## 
## $df
## [1] 387
## 
## $residual.scale
## [1] 3.34

From the above results, the predicted fuel efficiency is 22 mpg with a 95% prediction interval between 15.4 and 28.7.

The confidence interval, 95% is between 21.1 and 23mpg.

4 Simple Regression through simulations

4.1 Linear model through simulations

This exercise is designed to help you understand the linear model using simulations. In this exercise, we will generate \((x_i, y_i)\) pairs so that all linear model assumptions are met.

Presume that \(\mathbf{x}\) and \(\mathbf{y}\) are linearly related with a normal error \(\boldsymbol{\varepsilon}\) , such that \(\mathbf{y} = 1 + 1.2\mathbf{x} + \boldsymbol{\varepsilon}\). The standard deviation of the error \(\varepsilon_i\) is \(\sigma = 2\).

We can create a sample input vector (\(n = 40\)) for \(\mathbf{x}\) with the following code:

# Generates a vector of size 40 with equally spaced values between 0 and 1, inclusive
x <- seq(0, 1, length = 40)

4.1.1 Generate data

Create a corresponding output vector for \(\mathbf{y}\) according to the equation given above. Use set.seed(1). Then, create a scatterplot with \((x_i, y_i)\) pairs. Base R plotting is acceptable, but if you can, please attempt to use ggplot2 to create the plot. Make sure to have clear labels and sensible titles on your plots.

4.1.2 Understand the model

  1. Find the LS estimates of \(\boldsymbol{\beta}_0\) and \(\boldsymbol{\beta}_1\), using the lm() function. What are the true values of \(\boldsymbol{\beta}_0\) and \(\boldsymbol{\beta}_1\)? Do the estimates look to be good?

  2. What is your RSE for this linear model fit? Is it close to \(\sigma = 2\)?

  3. What is the 95% confidence interval for \(\boldsymbol{\beta}_1\)? Does this confidence interval capture the true \(\boldsymbol{\beta}_1\)?

  4. Overlay the LS estimates and the true lines of the mean function onto a copy of the scatterplot you made above.

4.1.3 diagnoses

  1. Provide residual plot where fitted \(\mathbf{y}\)-values are on the x-axis and residuals are on the y-axis.

  2. Provide a normal QQ plot of the residuals.

  3. Comment on how well the model assumptions are met for the sample you used.

4.2 Understand sampling distribution and confidence intervals

This part aims to help you understand the notion of sampling statistics and confidence intervals. Let’s concentrate on estimating the slope only.

Generate 100 samples of size \(n = 40\), and estimate the slope coefficient from each sample. We include some sample code below, which should guide you in setting up the simulation. Note: this code is easier to follow but suboptimal; see the appendix for a more optimal R-like way to run this simulation.

# Inializing variables. Note b_1, upper_ci, lower_ci are vectors
x <- seq(0, 1, length = 40) 
n_sim <- 100              # number of simulations
b1 <- 0                   # n_sim many LS estimates of beta_1 (=1.2). Initialize to 0 for now
upper_ci <- 0             # upper bound for beta_1. Initialize to 0 for now.
lower_ci <- 0             # lower bound for beta_1. Initialize to 0 for now.
t_star <- qt(0.975, 38)   # Food for thought: why 38 instead of 40? What is t_star?

# Perform the simulation
for (i in 1:n_sim){
  y <- 1 + 1.2 * x + rnorm(40, sd = 2)
  lse <- lm(y ~ x)
  lse_output <- summary(lse)$coefficients
  se <- lse_output[2, 2]
  b1[i] <- lse_output[2, 1]
  upper_ci[i] <- b1[i] + t_star * se
  lower_ci[i] <- b1[i] - t_star * se
}
results <- as.data.frame(cbind(se, b1, upper_ci, lower_ci))

# remove unecessary variables from our workspace
rm(se, b1, upper_ci, lower_ci, x, n_sim, b1, t_star, lse, lse_out) 
  1. Summarize the LS estimates of \(\boldsymbol{\beta}_1\) (stored in results$b1). Does the sampling distribution agree with theory?

  2. How many of your 95% confidence intervals capture the true \(\boldsymbol{\beta}_1\)? Display your confidence intervals graphically.